Audio-only mode and Enhanced Cloning

Translate Video settings: audio-only overlay vs lip sync, Enhanced Cloning quality and pricing.

Written By Umakhan Magomedov

Last updated 3 days ago

Translate Video has two optional settings that change how the translation is processed: Audio only and Enhanced Cloning. Both are available in the settings panel before you tap Translate.

Audio only

In Audio only mode, the tool translates the spoken audio and overlays it on the original video. The video visuals are not modified and no lip sync is applied.

Use Audio only when:

The video does not show a face or the speaker is not visible
Lip sync accuracy is not important for your use case
You want a faster result at the same cost as standard mode
The video has multiple speakers at once (lip sync would be inaccurate anyway)

ℹ️ Audio only does not reduce the token cost compared to standard mode. Both cost 3.67 tokens/second. The difference is quality of lip sync, not price.

Enhanced Cloning

Enhanced Cloning uses a more accurate voice model to better match the original speaker. The dubbed voice is closer to the original in tone and character.

Use Enhanced Cloning when:

Voice authenticity matters (interviews, personal content, documentary)
The speaker has a distinctive voice you want to preserve in the translation
You are translating content where the speaker is on camera and viewers know the original voice

Mode	Cost	Lip sync
Standard	3.67 tokens/second	Yes
Standard + Audio only	3.67 tokens/second	No
Enhanced Cloning	7.34 tokens/second	Yes
Enhanced Cloning + Audio only	7.34 tokens/second	No

How to enable

Open Translate Video from the Tools tab.
Choose your file or paste a link.
Tap the Settings icon (gear icon) in the top right.
Toggle Audio only or Enhanced Cloning.
Tap Translate. The updated token estimate reflects your choice.

VocaLingo

Audio-only mode and Enhanced Cloning

Audio only

Enhanced Cloning

How to enable

Frequently asked questions

Audio only

Enhanced Cloning

How to enable

Frequently asked questions

Can I use both Audio only and Enhanced Cloning together?

Does Enhanced Cloning work with all languages?

The lip sync looks off. Should I use Audio only instead?

Does the mode affect processing time?

Related articles