Translate Audio settings

Speech recognition, translation model and voiceover providers in Translate Audio: options, pricing and behavior.

Written By Umakhan Magomedov

Last updated 4 days ago

Open the Settings sheet in Translate Audio to control speech recognition, translation quality and voiceover. This article explains every option and when it applies.

Where to find settings

Open Translate Audio from the Tools tab.
Tap the Settings icon in the top right corner.
Change recognition, translation or voiceover options. Token estimates update immediately.

ℹ️ Speech recognition and translation model changes apply on the next file upload, not to the current result. Voiceover settings affect the next time you generate audio.

Recognition (speech-to-text)

Choose which engine transcribes the uploaded audio. The default is ElevenLabs Scribe.

Provider	Cost	Notes
ElevenLabs Scribe (default)	0.0133 tokens/sec	Recommended. Fast and accurate for most recordings.
OpenAI Transcribe	0.02 tokens/sec	gpt-4o-transcribe model. Good for noisy audio.
Whisper	0.01 tokens/sec	Budget option. Slightly slower on long files.

Translation

Pick the AI model for re-translations when you change the target language or edit the source text.

⚠️ The automatic pipeline on first upload always uses Gemini 3 on the backend, regardless of the model selected here. Settings only affect re-translations.

Model	Cost	Best for
Gemini Flash Lite	0.006 tokens/1K chars	Fastest, lowest cost re-translations
Gemini Flash	0.028 tokens/1K chars	Balanced speed and quality
Gemini 3 (default)	0.044 tokens/1K chars	First pipeline translation and high-quality re-translations
GPT-4o	0.156 tokens/1K chars	Maximum accuracy for complex text
GPT-5 Mini	0.028 tokens/1K chars	Good quality at moderate cost

Voiceover without cloning

Standard synthetic voices. No voice sample from the original audio is used.

Provider	Languages	Cost	Speed
ElevenLabs (default)	~74 languages	0.01 tokens/sec	~2 seconds
OpenAI	Wide support	0.03 tokens/sec	~5 seconds

If ElevenLabs does not support your target language, the app falls back to OpenAI automatically.

Voiceover with cloning

These providers clone the speaker voice from your uploaded audio or a saved Custom Voice.

MiniMax (recommended)

Cost	0.15 tokens/sec + 150 tokens first-time clone per voice
Speed control	0.5x to 2.0x
Emotions	7 presets + Auto
Min audio for clone	10 seconds
Saved Custom Voice	Yes, via Custom Voices

Qwen

Languages	10: Russian, English, Chinese, German, French, Spanish, Italian, Japanese, Korean, Portuguese
Cost	0.15 tokens/sec, minimum 5 tokens per request
Min audio for clone	3 seconds
Style presets	Available in auto_clone mode only, not with saved Custom Voices

HeyGen

Cost	1.84 tokens/sec (HeyGen v3 since June 3, 2026)
Generation time	~10 minutes for long text
Output format	Audio MP4
Saved Custom Voices	Not supported. Clones from uploaded audio only.

TTS behavior

Edit translation: changing the translated text clears the current voiceover. Tap play to regenerate.
Pending or completed jobs: MiniMax, Qwen and HeyGen jobs continue in the background. Reopening from History resumes playback or polling.
Language change: if the current cloning provider does not support the new language or the audio is too short, the app auto-switches to ElevenLabs.
Settings change: switching provider, speed, emotion or style clears cached audio for the current result.

VocaLingo

Translate Audio settings

Where to find settings

Recognition (speech-to-text)

Translation

Voiceover without cloning

Voiceover with cloning

MiniMax (recommended)

Qwen

HeyGen

TTS behavior

Frequently asked questions

Where to find settings

Recognition (speech-to-text)

Translation

Voiceover without cloning

Voiceover with cloning

MiniMax (recommended)

Qwen

HeyGen

TTS behavior

Frequently asked questions

What does MiniMax speed control do?

Can I use Qwen style presets with a saved Custom Voice?

HeyGen or MiniMax: which should I pick?

Why did my voiceover disappear after I edited the translation?

My audio is too short for voice cloning. What is the minimum?

ElevenLabs or OpenAI for standard voiceover?

When are the 150 tokens charged for MiniMax?

Can I use a HeyGen Custom Voice from my account?

Related articles