How Video to Text works

Upload a video or paste a link to get a full transcript, speaker labels, detected language and a structured summary.

Written By Umakhan Magomedov

Last updated 3 days ago

Video to Text (video transcription) turns speech in a video into a full text transcript with automatic language detection, speaker labels and an optional structured summary. Upload a file from your device or paste a link from YouTube, Instagram or TikTok.

When to use

Get a transcript of a video lesson, webinar or recorded presentation
Extract spoken content from a YouTube video without watching it manually
Turn a podcast-style video into searchable text for notes or translation
Prepare text from a video for editing, analysis or sharing

What you can upload

Formats: MP4, MOV, AVI, MKV, WebM, M4V

Source	Max size	Notes
File from device or gallery	Large files OK	Compressed before upload when needed
Link (YouTube, Instagram, TikTok, direct URL)	2 GB	Downloaded on the server, no manual download needed

Link sources: YouTube, Instagram, TikTok or a direct video URL.

Importing from a link

Tap Paste link, enter the URL and tap Download. The app shows three phases:

Downloading: VocaLingo fetches the video from the platform on the server.
Saving: the file is stored securely for processing.
Processing: the video is prepared and speech recognition starts automatically.

For link imports up to 2 GB, you do not upload the file from your device. See Importing audio from a link for the same flow in other tools.

How to run

Open Video to Text from the Tools tab on web, iOS or Android.
Tap Choose to pick a video file, or Paste link to import from a URL.
Processing starts automatically. For device uploads, the app may compress the video first to speed up transfer.
When done, the transcript appears on the Text tab. Open Summary for a structured overview.

ℹ️ After processing starts, you can close or minimize the app. When the transcript is ready, VocaLingo sends a push notification if notifications are enabled. See Push notifications.

What you get

Results are organized in three tabs:

Video: playback of the source video with download and share options. Link imports use remote streaming so large files play without loading the full file into device memory.
Text: full transcript with detected language and speaker labels when multiple speakers are present.
Summary: structured overview on demand (title, summary, key moments, takeaway and quote highlights). Tap Download PDF to export the summary as a formatted PDF. See Video summary.

Results are saved to History automatically.

Text tab options

On the Text tab you can:

Toggle Show timecodes to switch between plain transcript and segmented view with timestamps and speaker labels.
Copy the full transcript or share it from the toolbar.

Speech recognition

Video to Text uses ElevenLabs Scribe v2 for speech-to-text. This is not OpenAI Whisper. Scribe v2 supports automatic language detection and speaker diarization (who said what) for clearer transcripts and summaries.

How much it costs

Video to Text charges tokens based on video duration (speech recognition). The Summary tab uses additional tokens when you generate it. For the full pricing breakdown, see Token pricing for each tool.

When to use

What you can upload

Importing from a link

How to run

What you get

Text tab options

Speech recognition

How much it costs

Frequently asked questions

The link will not download. What should I try?

Processing takes a long time. Is that normal?

I did not get a push notification when transcription finished.

Video playback does not work on iOS.

What is the difference between device upload and paste link?

Does Video to Text detect the language automatically?

Can I download or share the source video?

Related articles