Practical Guide to WhisperX: Transcription and Diarization (Speaker-ID)

WhisperX is an advanced tool based on OpenAI's AI (Whisper) that not only transcribes audio with very high accuracy, but offers two crucial features for professional use:

This guide walks you through installation and usage step by step.

PHASE 1: Setting Up Hugging Face and Pyannote

WhisperX's diarization relies on models created by "Pyannote" and hosted on Hugging Face. These models are free but "gated" (access-controlled).

  1. Create a free account on Hugging Face.
  2. Accept the terms of use for pyannote/speaker-diarization-3.1 and pyannote/segmentation-3.0.
  3. Create a token with Read permissions (it will start with hf_...).

PHASE 2: Environment Installation

WhisperX requires an NVIDIA graphics card (GPU). Run these commands:

sudo apt-get update && sudo apt-get install -y ffmpeg
pip install git+https://github.com/m-bain/whisperX.git
pip install transformers==4.50.0
pip install --force-reinstall torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

PHASE 3: Running the Main Command

whisperx "audio_file_name.wav" \
  --model large-v3 \
  --language en \
  --diarize \
  --min_speakers 2 \
  --max_speakers 2 \
  --hf_token YOUR_TOKEN_HERE \
  --compute_type float16 \
  --output_dir /output/folder \
  --output_format all

Parameter Explanation:

Note: The Align Model Issue (Non-standard Languages)

If you use less common languages (e.g. Indonesian), add a specific Wav2Vec2 model to the command:
--align_model indonesian-nlp/wav2vec2-large-xlsr-indonesian

PHASE 4: Output Formats


Downloadable Material

Download the presentation