Sun Yat-sen University Licenses Multimodal Talking Face Video Generation Patent for RMB 10,000 to Guangdong Hengsheng Bio-Medical Technology Co., Ltd.

Jan 29, 2026 08:00 CST Updated 08:00

Recently, Sun Yat-sen University published a public notice on the non-exclusive licensing of patents, proposing to license a patented invention"A Method and Apparatus for Generating Talking Face Videos Based on Multimodal Information Control", with¥10,000The licensing fee grants Guangdong Hengsheng Biomedical Technology Co., Ltd. a three-year non-exclusive license for implementation. The inventor of this patent is Sun Yat-sen University.Professor Wang Haodi and his team。

The core patent proposed for commercialization in this instance is a talking-face video generation technology focused on the field of computer vision, with its key innovations lying inBreak through the limitations of traditional technologies that overlook facial emotional details, achieving more natural, high-fidelity, and emotionally controllable video generation by fusing multimodal information from vision, audio, and text.Technically, it is built upon a diffusion model framework. It first encodes the input image and adds Gaussian noise, then extracts identity and mask features from a visual face reference image, lip-sync control features from the input audio, and emotional features from the input emotional text. By leveraging joint embedding and ControlNet to generate emotion-control features, it ultimately performs denoising decoding and video frame interpolation under the synergistic guidance of multimodal features. This produces talking-face videos with audio-visual synchronization, high identity fidelity, and distinct facial expressions, offering an optimized solution for multi-scenario applications.

Technical Functional Deficiency: The Urgent Need for Talking Face Video Generation

The core of the “disease” discussed here refers to “technical defect-related issues” inherent in the practical application of talking-face video generation technology, namely, existing technologies"Difficult to generate natural, high-fidelity, emotionally controllable, and audio-visually synchronized talking face videos", such technological shortcomings prevent them from meeting the demand for high-quality interactive experiences across multiple domains, essentially constituting a “functional deficiency” at the level of technology application.

Talking-face video generation, a key application in computer vision, leverages inputs such as audio and images to reconstruct lip movements and facial expressions during speech, thereby producing videos with high visual fidelity and precise audio-visual synchronization. This technology is widely applied across entertainment, education, business, and human-computer interaction scenarios, serving as a critical enabler for enhancing virtual interaction experiences.

Current mainstream technical solutions in the industry are primarily categorized into three types:

First, methods based on Generative Adversarial Networks (GANs), generates images through adversarial training of a generator and a discriminator, often using facial landmarks or 3D models as an intermediate bridge;

The second is a solution based on Neural Radiance Fields (NeRF),Leveraging neural networks to model 3D spatial radiation fields, enabling the generation of 3D reconstructions from image perspectives;

Third, the diffusion model (DM)-based methods that have emerged in recent years,Generating complex facial images from random noise by gradually reducing the noise level to mitigate the problem of GAN training collapse.

However, all three existing approaches suffer from significant technical limitations that act as “bottlenecks” hindering their practical deployment. For GAN-based methods, the introduction of intermediate representations tends to cause semantic information loss, while poor training stability and high complexity make it difficult to ensure consistent generation quality. NeRF-based approaches exhibit noticeable jitter in generated video frames, failing to meet the demand for smooth and natural visual experiences. Although diffusion models offer advantages in image quality and training stability, most studies focus primarily on visual fidelity and audio-visual synchronization, largely overlooking the representation of facial emotional details. This results in generated faces lacking vivid emotional expression, thereby failing to replicate the emotional conveyance process observed in real human speech.

As virtual interaction scenarios continue to diversify, market demands for talking-face video generation technology are becoming increasingly stringent: it is not only required that the generated videos feature high-fidelity preservation of personal identity, smooth visual effects, and precise lip-syncing with audio, but more critically, the core demand isAchieving Precise Control of Facial Emotions, enabling virtual characters to convey complex emotions such as joy, seriousness, and gentleness, thereby enhancing the immersion and authenticity of interactions.

The glaring contradiction between the limitations of existing technologies and the profound demands across diverse scenarios urgently calls for an innovative solution capable of integrating multidimensional information and transcending traditional technical constraints. Such a solution is needed to address issues including deficient emotional expression, insufficient generation stability, and suboptimal visual fluency, thereby promoting the large-scale application of talking face video generation technology across various fields.

Multimodal Fusion + Controllable Emotion: Technological Innovation, Upgrades, and Core Advantages

Compared with existing solutions, this patented technology offers significant technical advantages and core innovations, comprehensively overcoming the technical bottlenecks of traditional talking-face video generation:

First,Innovatively Constructing a Multimodal Information Fusion Framework to Achieve Comprehensive Enhancement of Generation QualityUnlike traditional technologies that are limited by their sole reliance on either images or audio, this technology systematically integrates three core modalities—visual, audio, and text—for the first time. By extracting features related to identity, face masks, audio lip-syncing, and emotional text, it establishes a multi-dimensional collaborative control system. Specifically, the identity and mask features within the visual modality ensure high fidelity of generated faces and precise spatial localization; audio modality features achieve highly synchronized lip movements and speech; and text modality features address gaps in emotional expression. The deep fusion of these multimodal data enables the generated videos to combine high visual realism, strong audio-visual synchronization, and rich emotional expression, significantly enhancing the immersion and authenticity of interactions.

Secondly,Design a proprietary talking face reconstruction module to precisely address the challenges of generation stability and fidelity.This module integrates key auditory and visual features as external conditions deeply into the denoising process of the diffusion model: identity features extracted via a pre-trained image encoder ensure identity consistency between the generated face and the reference image; facial mask features, obtained through keypoint detection and encoding, provide precise spatial localization guidance for the generation process; and encoded audio features achieve fine-grained control over lip movements through a cross-attention mechanism. This design effectively avoids the semantic loss and training instability caused by reliance on intermediate representations in traditional GAN models, while also addressing the frame flickering issue inherent in NeRF-based techniques, thereby significantly enhancing the stability and fidelity of the generated videos.

Third,Pioneering the Facial Emotion Control Module to Address the Industry Pain Point of Missing Emotional Details.To address the limitation of existing diffusion models in neglecting facial emotional expression, this technology achieves precise alignment between emotional semantics and visual content through contrastive learning of emotional text features and secondary character identity features. After joint embedding into a shared feature space, it generates joint emotional-visual features. These are then converted into emotional control features via a pre-trained ControlNet and integrated into the denoising process through a cross-attention mechanism, enabling precise control over complex facial emotions such as joy and seriousness. This innovation ensures that generated virtual characters are no longer “emotionless speaking machines,” successfully replicating the emotional transmission process observed in real human speech and filling a technological gap in the industry.

Fourth,Adopts an advanced diffusion model generation architecture, balancing generation quality and training efficiency.The technology adopts a diffusion model as its core framework, avoiding the training instability issues commonly associated with GANs by applying Gaussian noise to and subsequently performing stepwise denoising on the latent representations of input images encoded in the latent space.

Meanwhile, the denoising module consists of multiple U-Nets that achieve efficient denoising by predicting noise residuals at different timesteps. In conjunction with an image frame interpolator for video synthesis, this approach effectively reduces the computational load and cost of model training while ensuring high-quality generated images. Furthermore, the core logic of this technology demonstrates strong generalizability and scalability; it can be adapted to various datasets and generation tasks with minimal adjustments, offering broad applicability and significant potential for practical implementation.

From Technological Breakthroughs to Industrial Empowerment: Competition in Domestic and International Markets and Scenario Implementation

In the field of multimodal information-controlled talking-face video generation, industry pain points such as inadequate emotional expression, insufficient generation stability, and difficulties in multi-character coordination are being addressed through accelerated technological breakthroughs driven by collaborative innovation among academia, research institutions, and leading enterprises both domestically and internationally. By leveraging differentiated technical approaches to build core competitiveness, these stakeholders are promoting the deployment and application of this technology across various industrial scenarios.

ACTalker (HKUST + Tencent + Tsinghua University)Co-developed as the world's first multimodal Mamba-driven end-to-end video diffusion framework, its core innovation lies inIntegration of Parallel Mamba Architecture and Mask-Drop StrategyTechnically, a multi-branch design is adopted, where each branch independently processes driving signals such as audio and visual facial movements to achieve precise control over specific facial regions. Relying on a gating mechanism, the system can flexibly switch between single- and multi-signal control modes. Coupled with feature manipulation in the spatiotemporal dimensions via the Mamba architecture, this ensures the natural coherence of the generated videos. Furthermore, the mask-drop strategy enables each driving signal to independently manage its corresponding facial region, effectively avoiding control conflicts and significantly improving the synchronization accuracy between the virtual human’s lip movements and speech.

Leveraging Tencent’s ecosystem resources, ACTalker has achieved initial implementation.Games, SocialTwo Core Application Scenarios. In the gaming sector, the technology can directly generate speech animations aligned with character personas using multimodal signals, replacing the complex traditional workflows of manual motion capture and 3D modeling. This significantly shortens the production cycle for NPC dialogue animations in games. It has already been piloted in certain casual mobile games, markedly improving game development efficiency. In the virtual social networking sector, it supports users in uploading audio and facial reference images to generate personalized avatar speech videos. Accommodating individualized needs for single- or multi-signal control, it meets young users’ demand for highly synchronized and controllable facial animations. In the future, it is expected to be deeply integrated with avatar features in social products such as Tencent QQ and WeChat, further expanding market coverage.

MultiTalk (Open-Sourced by Sun Yat-sen University, Meituan, and HKUST)MultiTalk, with DiT as its core backbone, is the first SOTA model supporting multi-virtual-human conversational video generation. Its core breakthrough lies inLabel Rotary Position Embedding (L-RoPE) Technology. By adaptively segmenting the latent video features into distinct person regions, assigning independent labels to each speaker, and mapping them via RoPE to the audio cross-attention layers, precise alignment between audio and lip movements is achieved, thereby resolving the person-audio mismatch issue in multi-audio stream inputs.

Simultaneously, we have designed partial parameter tuning and multi-task training strategies to balance instruction-following capabilities with visual output quality, supporting text prompts to control interactions among characters, objects, and scenes. From the perspectives of the open-source ecosystem and industrial applications, MultiTalk has established an initial technology promotion ecosystem. In the field of virtual influencers and on live-streaming platforms, its multi-character dialogue generation capability is leveraged to create interactive live-streaming scenarios involving multiple virtual humans. While traditional single-virtual-human live streaming offers limited interaction formats, MultiTalk supports the real-time generation of animated dialogues among multiple virtual humans, thereby enhancing content richness and viewer engagement. It has currently been piloted in certain game live streams and educational science popularization broadcasts, demonstrating a significant increase in user retention rates compared to single-virtual-human live streaming.

From the perspective of industry development trends, talking-face video generation technology based on multimodal information control is evolving toward real-time processing, multi-character support, and deep scene integration. Core technical competitiveness is increasingly focused on cross-modal alignment accuracy, naturalness of generated outputs, and multi-signal collaborative control capabilities. Meanwhile, collaborative innovation among academia, research institutions, and industry has become a key driver for technological deployment; breakthroughs achieved by universities and research institutes are rapidly commercialized through corporate ecosystem resources. Furthermore, open-source models have further lowered the barrier to entry for adopting these technologies, promoting their widespread application across various fields.

In the future, as virtual interaction scenarios continue to diversify, this technology is poised to achieve more large-scale adoption across sectors such as gaming, social networking, education, healthcare, and government services. Meanwhile, the integration of technology with industry-specific operations will deepen, evolving from a mere content generation tool into scenario-based solutions. Continuous technological iterations and the ongoing expansion of application scenarios will also foster more robust technical standards and industrial ecosystems. Consequently, balancing technological innovation with regulatory compliance and enhancing the inclusiveness of these technologies will remain critical challenges requiring sustained exploration throughout the industry’s development.