The main architecture of the proposed InstructDubber. To predict the lip-synchronized phoneme-level duration, the Instructed Duration Distilling module (IDD) mines the duration cues from fine-grained speaking rate instructions. The Instructed Emotion Calibrating module (IEC) fine-tunes a lightweight LLM to analyze the emotion instructions using the emotion entities from ground truth dubbing as supervision, and predicts the dubbing prosody based on the calibrated emotion analysis.
| An example of each benchmark. | V2C-Animation Benchmark | Chem Benchmark | GRID Benchmark | V2C-Animation derived from real animated films, characterized by exaggerated prosodic variations and complex visual scenes, making it the most challenging dubbing benchmark. | The Chem dataset is collected from chemistry lecture videos on YouTube, featuring a fixed camera perspective, moderate variations in speaking rate and emotional expressions. | GRID Benchmark is a basic and wide-used multi-speaker dubbing benchmark recorded in a noise-free studio with a unified screen background with a stable speaking rate and minimal emotional variation. |
|---|
Due to space limitations and our primary focus on proposing a novel approach for dubbing alignment, we do not include comparisons of voice cloning performance across different models in the main text. Here, we compare the voice cloning performance of our approach with the state-of-the-art dubbing model ProDubber, which also serves as our baseline. We extract speaker embeddings from both the generated dubbing and the ground-truth audio using a GE2E-based speaker encoder and compute the cosine similarity between them. The results are presented in Table S3 of the Appendix. As shown in the table, our method achieves comparable speaker similarity performance to ProDubber in the in-domain dubbing scenario. Moreover, in zero-shot dubbing scenarios, our approach demonstrates superior voice cloning performance due to its enhanced robustness and generalization capability across different visual domains.
| Chem Sample: 0G0wCm28Jzc-005 | ||||||
|
Text Content: It doesn't dissolve very much. Speaking Rate Instruction:The character's speaking pace in the video is relatively slow and measured, with a clear and deliberate delivery. There are no significant variations in the pace throughout the video, and the character maintains a consistent speaking rate throughout the script. The character's mouth movements are also consistent, with minimal variations in the shape and position of the mouth, which further supports the slow and measured pace of the speech. Emotion Instruction: Based on the video content and the script, and the man's body language, with his arms outstretched and his facial expression, suggests that he is passionately conveying his message. The fact that he is speaking in front of a screen with the chemical formulae suggests that he is likely an expert or educator in the field, which could add to his frustration if the substance he is discussing is not dissolving as expected. Overall, the emotional tone of the character seems to be one of concern or annoyance, as he is emphasizing the lack of dissolution in the substance. |
||||||
| Zero-shot Dubbing under GRID2Chem scenario. | Ground-Truth | StyleDubber | Speaker2Dubber | DeepDubber | ProDubber | InstructDubber(Ours) |
|---|
| Zero-shot Dubbing under V2C2Chem scenario. | ||||||
| Ground-Truth | StyleDubber | Speaker2Dubber | DeepDubber | ProDubber | InstructDubber(Ours) | |
|---|---|---|---|---|---|---|
| In-domain Dubbing scenario | ||||||
| Ground-Truth | StyleDubber | Speaker2Dubber | DeepDubber | ProDubber | InstructDubber(Ours) |
| V2C-Animation Sample: Bossbaby@BossBaby_00_0199_00 | ||||||
|
Text Content: And see if there's someplace around here with decent sushi. Speaking Rate Instruction:The speaking pace of the character in the video is relatively slow and measured. There are no significant variations in the pace throughout the video. The character's mouth movements remain consistent and deliberate, indicating a steady and controlled speaking pace. Emotion Instruction: Based on the video content and the script provided, the character appears to be in a somewhat frustrated or disappointed state. The character's facial expression and body language suggest that they are not satisfied with the current situation or the lack of sushi options available. The character's search for sushi indicates a desire for something more, possibly a craving or a preference for sushi over other food options. The emotional change in the character can be interpreted as a mix of frustration, disappointment, and perhaps even hunger or craving. |
||||||
| Zero-shot Dubbing under Chem2V2C scenario | Ground-Truth | StyleDubber | Speaker2Dubber | DeepDubber | ProDubber | InstructDubber(Ours) |
|---|
| Zero-shot Dubbing under GRID2V2C scenario | ||||||
| Ground-Truth | StyleDubber | Speaker2Dubber | DeepDubber | ProDubber | InstructDubber(Ours) | |
|---|---|---|---|---|---|---|
| In-domain Dubbing scenario | ||||||
| Ground-Truth | StyleDubber | Speaker2Dubber | DeepDubber | ProDubber | InstructDubber(Ours) |
| V2C-Animation Sample: Dragon@AstridHofferson_00_0365_00 | ||||||
|
Text Content: Figure out which side you're on. Speaking Rate Instruction:The character's speaking pace in the video varies, with a faster pace during the initial part of the video and a slower pace towards the end. The trend suggests a progression from a more hurried speech to a more measured and contemplative tone as the character's thoughts and emotions evolve. Emotion Instruction: Based on the video content and the script, the character appears to be conflicted and unsure about which side she is on. Her initial expression is one of determination and readiness for battle, suggesting that she is on one side. However, as the scene progresses and she encounters other characters, her expression changes to one of confusion and uncertainty. This could indicate that she is questioning her allegiance or trying to understand the motivations of the different sides. The emotional journey of the character seems to be one of self-discovery and figuring out where her loyalties lie. |
||||||
| Zero-shot Dubbing under GRID2V2C scenario | Ground-Truth | StyleDubber | Speaker2Dubber | DeepDubber | ProDubber | InstructDubber(Ours) |
|---|
| Zero-shot Dubbing under Chem2V2C scenario | ||||||
| Ground-Truth | StyleDubber | Speaker2Dubber | DeepDubber | ProDubber | InstructDubber(Ours) | |
|---|---|---|---|---|---|---|
| In-domain Dubbing scenario | ||||||
| Ground-Truth | StyleDubber | Speaker2Dubber | DeepDubber | ProDubber | InstructDubber(Ours) |
| V2C-Animation sample: Inside@Riley_00_0212_00 | ||||||
|
Text Content: Hey, i saw a pizza place down the street. Speaking Rate Instruction:Based on the script provided and the character's mouth movements in the video, the speaking pace appears to be relatively slow and measured. The character takes their time to speak, and there are no signs of rapid or hurried speech. The trend of the speaking pace remains consistent throughout the video, with no significant variations or changes in speed. Emotion Instruction: Based on the video content and the script provided, the character appears to be excited or intrigued by the mention of a pizza place down the street. Her eyes widen and her expression changes to one of interest or anticipation. This suggests that she may be looking forward to trying the pizza or perhaps is hungry and excited about the prospect of a delicious meal. The emotional change in the character is one of eagerness and pleasure, indicating a positive reaction to the news about the pizza place. |
||||||
| Zero-shot Dubbing under Chem2V2C scenario | Ground-Truth | StyleDubber | Speaker2Dubber | DeepDubber | ProDubber | InstructDubber(Ours) |
|---|
| Zero-shot Dubbing under GRID2V2C scenario | ||||||
| Ground-Truth | StyleDubber | Speaker2Dubber | DeepDubber | ProDubber | InstructDubber(Ours) | |
|---|---|---|---|---|---|---|
| In-domain Dubbing scenario | ||||||
| Ground-Truth | StyleDubber | Speaker2Dubber | DeepDubber | ProDubber | InstructDubber(Ours) |
| Chem benchmark sample: 7RrOhe6SSj0-003.wav | ||||||
|
Text Content: three we let the reaction go. Speaking Rate Instruction:Based on the script \"three we let the reaction go,\" the speaking pace appears to be relatively slow and measured. The character's mouth movements are deliberate and paced, suggesting a careful and thoughtful delivery. There are no significant variations in the pace throughout the video. The character maintains a consistent speaking pace that aligns with the script's intended delivery. Emotion Instruction: Based on the video content and the script \"three we let the reaction go,\" the character appears to be in a state of decision-making or negotiation. The character's emotional changes could be interpreted as a shift from a more assertive or defensive stance to a more conciliatory or compromising one. The character's body language and facial expressions suggest a willingness to listen and consider alternative perspectives. This emotional change could indicate a desire to find a resolution or agreement that benefits both parties involved. |
||||||
| Zero-shot Dubbing under GRID2V2C scenario | Ground-Truth | StyleDubber | Speaker2Dubber | DeepDubber | ProDubber | InstructDubber(Ours) |
|---|
| Zero-shot Dubbing under V2C2Chem scenario | ||||||
| Ground-Truth | StyleDubber | Speaker2Dubber | DeepDubber | ProDubber | InstructDubber(Ours) | |
|---|---|---|---|---|---|---|
| In-domain Dubbing scenario | ||||||
| Ground-Truth | StyleDubber | Speaker2Dubber | DeepDubber | ProDubber | InstructDubber(Ours) |