How Translate Audio works

Learn how Translate Audio recognizes speech, translates it, and generates voiced audio in your chosen language.

Written By Umakhan Magomedov

Last updated 4 days ago

Translate Audio takes audio from your device, a link or a messenger, recognizes the speech, translates it, and generates a voiced audio in the target language.

Translate Audio file selection screen

When to use

  • Understand a voice message from Telegram, WhatsApp or another messenger

  • Get a translation of a podcast, interview or audio recording

  • Transcribe and translate a lecture or meeting recording

  • Listen to the translated content as audio, not just read it


What you can upload

Formats: MP3, M4A, WAV, OGG, FLAC, AAC, OPUS, WebM. Video files are also accepted: the audio track is extracted automatically.

Max file size: 10 MB

Sources:

  • File from your device or gallery

  • Audio shared from a messenger: Telegram, WhatsApp, Viber, iMessage

  • Link to YouTube, Instagram or TikTok, or a direct audio/video URL


How to run

  1. Open the Translate Audio tool from the Tools tab.

  2. Choose a source: tap Choose to pick a file from your device, tap Paste link to import from a URL, or share audio directly from a messenger app.

  3. Recognition starts automatically. The source text appears in the Source text tab as the audio is processed.

  4. The app switches to the Translation tab once recognition is done. Select the target language at the top if needed.

  5. The translated text and a voiced audio file appear in the Translation tab. Tap play to listen.

ℹ️ To share a voice message from Telegram or WhatsApp, open the message, tap Share and select VocaLingo. See How to share audio from messengers for step-by-step instructions per app.

Translate Audio in action: from file selection to translation result

What you get

After processing, results are organized into three tabs:

  • Source text: the original speech transcribed as text, with a mini audio player for the uploaded file.

  • Translation: the translated text and a playable audio file in the target language. You can edit the translation and regenerate the audio if needed.

  • Summary: a short summary of the translated content, generated on demand.

Source text tab with recognized speech

Source text

Translation tab with translated text and audio player

Translation

Results are saved to History automatically after translation. You can reopen any past result from the history icon without uploading the file again.


How much it costs

The tool uses tokens for three steps: speech recognition, text translation, and audio generation. The cost of audio generation depends on the voice provider selected in Settings. For the full pricing breakdown, see Token pricing for each tool.


Voice provider settings

Tap the settings icon in the top right corner to choose how the translated audio is generated:

Voice provider settings sheet

Provider

Voice type

Generation time

Price

Default (OpenAI)

Standard synthetic voice

~5 seconds

0.03 tokens/sec

Voice cloning (MiniMax)

Cloned from original audio

~60 seconds

150 tokens fixed + 0.15 tokens/sec

Voice cloning (Heygen)

Cloned, highest quality

~10 minutes

5 tokens/sec

MiniMax and Heygen clone the voice from the original audio, so the translation sounds closer to the original speaker. Heygen generally produces the most natural result but takes longer and costs more.


Frequently asked questions

Related articles