WhisperX: Precision Transcriptions
Open source tool for advanced audio transcription.
Practical Guide to WhisperX: Transcription and Diarization (Speaker-ID)
WhisperX is an advanced tool based on OpenAI's AI (Whisper) that not only transcribes audio with very high accuracy, but offers two crucial features for professional use:
- Temporal alignment: Assigns a precise time (milliseconds) to every single word.
- Diarization (Pyannote): Recognizes and separates the voices of different speakers.
This guide walks you through installation and usage step by step.
PHASE 1: Setting Up Hugging Face and Pyannote
WhisperX's diarization relies on models created by "Pyannote" and hosted on Hugging Face. These models are free but "gated" (access-controlled).
- Create a free account on Hugging Face.
- Accept the terms of use for
pyannote/speaker-diarization-3.1andpyannote/segmentation-3.0. - Create a token with Read permissions (it will start with
hf_...).
PHASE 2: Environment Installation
WhisperX requires an NVIDIA graphics card (GPU). Run these commands:
sudo apt-get update && sudo apt-get install -y ffmpeg
pip install git+https://github.com/m-bain/whisperX.git
pip install transformers==4.50.0
pip install --force-reinstall torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 PHASE 3: Running the Main Command
whisperx "audio_file_name.wav" \
--model large-v3 \
--language en \
--diarize \
--min_speakers 2 \
--max_speakers 2 \
--hf_token YOUR_TOKEN_HERE \
--compute_type float16 \
--output_dir /output/folder \
--output_format all Parameter Explanation:
- --model large-v3: Uses the largest and most accurate Whisper model.
- --language [code]: Forces the audio language (e.g.
en,it,id). - --diarize: Activates speaker recognition.
- --min_speakers / --max_speakers: Helps the AI avoid creating "ghost" voices.
- --compute_type float16: Optimizes video RAM usage.
- --output_format all: Generates all useful formats.
Note: The Align Model Issue (Non-standard Languages)
If you use less common languages (e.g. Indonesian), add a specific Wav2Vec2 model to the command:--align_model indonesian-nlp/wav2vec2-large-xlsr-indonesian
PHASE 4: Output Formats
- .json: Text, millisecond-level word timings, speaker ID.
- .srt / .vtt: Standard subtitle formats.
- .txt: Clean plain text.
- .tsv: Tabular format for Excel or Sheets.