InstructDubber: Instruction-based Alignment for Zero-shot Movie Dubbing
ABSTRACT
Movie dubbing seeks to synthesize speech from a given script using a specific voice, while ensuring accurate lip synchronization and emotion-prosody alignment with the character’s visual performance. However, existing alignment approaches based on visual features face two key limitations: (1) they rely on complex, handcrafted visual preprocessing pipelines, including facial landmark detection and feature extraction; and (2) they generalize poorly to unseen visual domains, often resulting in degraded alignment and dubbing quality. To address these issues, we propose InstructDubber, a novel instruction-based alignment dubbing method for both robust in-domain and zero-shot movie dubbing. Specifically, we first feed the video, script, and corresponding prompts into a multimodal large language model to generate natural language dubbing instructions regarding the speaking rate and emotion state depicted in the video, which is robust to visual domain variations. Second, we design an instructed duration distilling module to mine discriminative duration cues from speaking rate instructions to predict lip-aligned phoneme-level pronunciation duration. Third, for emotion-prosody alignment, we devise an instructed emotion calibrating module, which fine-tunes an LLM-based instruction analyzer using ground truth dubbing emotion as supervision and predicts prosody based on the calibrated emotion analysis. Finally, the predicted duration and prosody, together with the script, are fed into the audio decoder to generate video-aligned dubbing. Extensive experiments on three major benchmarks demonstrate that InstructDubber outperforms state‑of‑the‑art approaches across both in‑domain and zero‑shot scenarios. The code are available at https://github.com/ZZDoog/InstructDubber.
MODEL ARCHITECTURE

The main architecture of the proposed InstructDubber. To predict the lip-synchronized phoneme-level duration, the Instructed Duration Distilling module (IDD) mines the duration cues from fine-grained speaking rate instructions. The Instructed Emotion Calibrating module (IEC) fine-tunes a lightweight LLM to analyze the emotion instructions using the emotion entities from ground truth dubbing as supervision, and predicts the dubbing prosody based on the calibrated emotion analysis.

BENCHMARK ANALYSIS
An example of each benchmark.
V2C-Animation Benchmark Chem Benchmark GRID Benchmark
V2C-Animation derived from real animated films, characterized by exaggerated prosodic variations and complex visual scenes, making it the most challenging dubbing benchmark. The Chem dataset is collected from chemistry lecture videos on YouTube, featuring a fixed camera perspective, moderate variations in speaking rate and emotional expressions. GRID Benchmark is a basic and wide-used multi-speaker dubbing benchmark recorded in a noise-free studio with a unified screen background with a stable speaking rate and minimal emotional variation.

Thus, in the main paper, we evaluate emotion similarity (EMO-SIM) on the V2C-Animation and Chem benchmarks. In this demo, we primarily showcase dubbing samples from the more challenging and discriminative V2C-Animation and Chem benchmarks.

Due to space limitations and our primary focus on proposing a novel approach for dubbing alignment, we do not include comparisons of voice cloning performance across different models in the main text. Here, we compare the voice cloning performance of our approach with the state-of-the-art dubbing model ProDubber, which also serves as our baseline. We extract speaker embeddings from both the generated dubbing and the ground-truth audio using a GE2E-based speaker encoder and compute the cosine similarity between them. The results are presented in Table S3 of the Appendix. As shown in the table, our method achieves comparable speaker similarity performance to ProDubber in the in-domain dubbing scenario. Moreover, in zero-shot dubbing scenarios, our approach demonstrates superior voice cloning performance due to its enhanced robustness and generalization capability across different visual domains.

EXPERIMENTS
Current SOTA Dubbing Baselines (All experimental results use the official code or providing checkpoint):
1) StyleDubber (ACL'24) is a SOTA Dubbing model using multi-scale style learning at the multi-modal phoneme level and acoustics utterance level.
2) Speaker2Dubber (ACM MM'24) is a SOTA pre-trained dubbing method with two-stage strategy to learn pronunciation from an additional TTS corpus.
3) DeepDubber (arXiv'25) is a SOTA pre-trained dubbing method with two-stage strategy to learn pronunciation from an additional TTS corpus.
4) ProDubber (CVPR'25) is a SOTA Dubbing model which first learn acoustic modeling ability from text-speech corpus then adapt the prosody to given videos.

Chem Sample: 0G0wCm28Jzc-005
Text Content: It doesn't dissolve very much.
Speaking Rate Instruction:The character's speaking pace in the video is relatively slow and measured, with a clear and deliberate delivery. There are no significant variations in the pace throughout the video, and the character maintains a consistent speaking rate throughout the script. The character's mouth movements are also consistent, with minimal variations in the shape and position of the mouth, which further supports the slow and measured pace of the speech.
Emotion Instruction: Based on the video content and the script, and the man's body language, with his arms outstretched and his facial expression, suggests that he is passionately conveying his message. The fact that he is speaking in front of a screen with the chemical formulae suggests that he is likely an expert or educator in the field, which could add to his frustration if the substance he is discussing is not dissolving as expected. Overall, the emotional tone of the character seems to be one of concern or annoyance, as he is emphasizing the lack of dissolution in the substance.
Zero-shot Dubbing under GRID2Chem scenario.
Ground-Truth StyleDubber Speaker2Dubber DeepDubber ProDubber InstructDubber(Ours)
Zero-shot Dubbing under V2C2Chem scenario.
Ground-Truth StyleDubber Speaker2Dubber DeepDubber ProDubber InstructDubber(Ours)
In-domain Dubbing scenario
Ground-Truth StyleDubber Speaker2Dubber DeepDubber ProDubber InstructDubber(Ours)
V2C-Animation Sample: Bossbaby@BossBaby_00_0199_00
Text Content: And see if there's someplace around here with decent sushi.
Speaking Rate Instruction:The speaking pace of the character in the video is relatively slow and measured. There are no significant variations in the pace throughout the video. The character's mouth movements remain consistent and deliberate, indicating a steady and controlled speaking pace.
Emotion Instruction: Based on the video content and the script provided, the character appears to be in a somewhat frustrated or disappointed state. The character's facial expression and body language suggest that they are not satisfied with the current situation or the lack of sushi options available. The character's search for sushi indicates a desire for something more, possibly a craving or a preference for sushi over other food options. The emotional change in the character can be interpreted as a mix of frustration, disappointment, and perhaps even hunger or craving.
Zero-shot Dubbing under Chem2V2C scenario
Ground-Truth StyleDubber Speaker2Dubber DeepDubber ProDubber InstructDubber(Ours)
Zero-shot Dubbing under GRID2V2C scenario
Ground-Truth StyleDubber Speaker2Dubber DeepDubber ProDubber InstructDubber(Ours)
In-domain Dubbing scenario
Ground-Truth StyleDubber Speaker2Dubber DeepDubber ProDubber InstructDubber(Ours)
V2C-Animation Sample: Dragon@AstridHofferson_00_0365_00
Text Content: Figure out which side you're on.
Speaking Rate Instruction:The character's speaking pace in the video varies, with a faster pace during the initial part of the video and a slower pace towards the end. The trend suggests a progression from a more hurried speech to a more measured and contemplative tone as the character's thoughts and emotions evolve.
Emotion Instruction: Based on the video content and the script, the character appears to be conflicted and unsure about which side she is on. Her initial expression is one of determination and readiness for battle, suggesting that she is on one side. However, as the scene progresses and she encounters other characters, her expression changes to one of confusion and uncertainty. This could indicate that she is questioning her allegiance or trying to understand the motivations of the different sides. The emotional journey of the character seems to be one of self-discovery and figuring out where her loyalties lie.
Zero-shot Dubbing under GRID2V2C scenario
Ground-Truth StyleDubber Speaker2Dubber DeepDubber ProDubber InstructDubber(Ours)
Zero-shot Dubbing under Chem2V2C scenario
Ground-Truth StyleDubber Speaker2Dubber DeepDubber ProDubber InstructDubber(Ours)
In-domain Dubbing scenario
Ground-Truth StyleDubber Speaker2Dubber DeepDubber ProDubber InstructDubber(Ours)
V2C-Animation sample: Inside@Riley_00_0212_00
Text Content: Hey, i saw a pizza place down the street.
Speaking Rate Instruction:Based on the script provided and the character's mouth movements in the video, the speaking pace appears to be relatively slow and measured. The character takes their time to speak, and there are no signs of rapid or hurried speech. The trend of the speaking pace remains consistent throughout the video, with no significant variations or changes in speed.
Emotion Instruction: Based on the video content and the script provided, the character appears to be excited or intrigued by the mention of a pizza place down the street. Her eyes widen and her expression changes to one of interest or anticipation. This suggests that she may be looking forward to trying the pizza or perhaps is hungry and excited about the prospect of a delicious meal. The emotional change in the character is one of eagerness and pleasure, indicating a positive reaction to the news about the pizza place.
Zero-shot Dubbing under Chem2V2C scenario
Ground-Truth StyleDubber Speaker2Dubber DeepDubber ProDubber InstructDubber(Ours)
Zero-shot Dubbing under GRID2V2C scenario
Ground-Truth StyleDubber Speaker2Dubber DeepDubber ProDubber InstructDubber(Ours)
In-domain Dubbing scenario
Ground-Truth StyleDubber Speaker2Dubber DeepDubber ProDubber InstructDubber(Ours)
Chem benchmark sample: 7RrOhe6SSj0-003.wav
Text Content: three we let the reaction go.
Speaking Rate Instruction:Based on the script \"three we let the reaction go,\" the speaking pace appears to be relatively slow and measured. The character's mouth movements are deliberate and paced, suggesting a careful and thoughtful delivery. There are no significant variations in the pace throughout the video. The character maintains a consistent speaking pace that aligns with the script's intended delivery.
Emotion Instruction: Based on the video content and the script \"three we let the reaction go,\" the character appears to be in a state of decision-making or negotiation. The character's emotional changes could be interpreted as a shift from a more assertive or defensive stance to a more conciliatory or compromising one. The character's body language and facial expressions suggest a willingness to listen and consider alternative perspectives. This emotional change could indicate a desire to find a resolution or agreement that benefits both parties involved.
Zero-shot Dubbing under GRID2V2C scenario
Ground-Truth StyleDubber Speaker2Dubber DeepDubber ProDubber InstructDubber(Ours)
Zero-shot Dubbing under V2C2Chem scenario
Ground-Truth StyleDubber Speaker2Dubber DeepDubber ProDubber InstructDubber(Ours)
In-domain Dubbing scenario
Ground-Truth StyleDubber Speaker2Dubber DeepDubber ProDubber InstructDubber(Ours)