Translate Audio settings

Speech recognition, translation model and voiceover providers in Translate Audio: options, pricing and behavior.

Written By Umakhan Magomedov

Last updated 4 days ago

Open the Settings sheet in Translate Audio to control speech recognition, translation quality and voiceover. This article explains every option and when it applies.

Where to find settings

  1. Open Translate Audio from the Tools tab.

  2. Tap the Settings icon in the top right corner.

  3. Change recognition, translation or voiceover options. Token estimates update immediately.

ℹ️ Speech recognition and translation model changes apply on the next file upload, not to the current result. Voiceover settings affect the next time you generate audio.


Recognition (speech-to-text)

Choose which engine transcribes the uploaded audio. The default is ElevenLabs Scribe.

Provider

Cost

Notes

ElevenLabs Scribe (default)

0.0133 tokens/sec

Recommended. Fast and accurate for most recordings.

OpenAI Transcribe

0.02 tokens/sec

gpt-4o-transcribe model. Good for noisy audio.

Whisper

0.01 tokens/sec

Budget option. Slightly slower on long files.


Translation

Pick the AI model for re-translations when you change the target language or edit the source text.

⚠️ The automatic pipeline on first upload always uses Gemini 3 on the backend, regardless of the model selected here. Settings only affect re-translations.

Model

Cost

Best for

Gemini Flash Lite

0.006 tokens/1K chars

Fastest, lowest cost re-translations

Gemini Flash

0.028 tokens/1K chars

Balanced speed and quality

Gemini 3 (default)

0.044 tokens/1K chars

First pipeline translation and high-quality re-translations

GPT-4o

0.156 tokens/1K chars

Maximum accuracy for complex text

GPT-5 Mini

0.028 tokens/1K chars

Good quality at moderate cost


Voiceover without cloning

Standard synthetic voices. No voice sample from the original audio is used.

Provider

Languages

Cost

Speed

ElevenLabs (default)

~74 languages

0.01 tokens/sec

~2 seconds

OpenAI

Wide support

0.03 tokens/sec

~5 seconds

If ElevenLabs does not support your target language, the app falls back to OpenAI automatically.


Voiceover with cloning

These providers clone the speaker voice from your uploaded audio or a saved Custom Voice.

Cost

0.15 tokens/sec + 150 tokens first-time clone per voice

Speed control

0.5x to 2.0x

Emotions

7 presets + Auto

Min audio for clone

10 seconds

Saved Custom Voice

Yes, via Custom Voices

Qwen

Languages

10: Russian, English, Chinese, German, French, Spanish, Italian, Japanese, Korean, Portuguese

Cost

0.15 tokens/sec, minimum 5 tokens per request

Min audio for clone

3 seconds

Style presets

Available in auto_clone mode only, not with saved Custom Voices

HeyGen

Cost

1.84 tokens/sec (HeyGen v3 since June 3, 2026)

Generation time

~10 minutes for long text

Output format

Audio MP4

Saved Custom Voices

Not supported. Clones from uploaded audio only.


TTS behavior

  • Edit translation: changing the translated text clears the current voiceover. Tap play to regenerate.

  • Pending or completed jobs: MiniMax, Qwen and HeyGen jobs continue in the background. Reopening from History resumes playback or polling.

  • Language change: if the current cloning provider does not support the new language or the audio is too short, the app auto-switches to ElevenLabs.

  • Settings change: switching provider, speed, emotion or style clears cached audio for the current result.


Frequently asked questions