WhisperX: Practical Guide

Practical Guide to WhisperX: Transcription and Diarization (Speaker-ID)

WhisperX is an advanced tool based on OpenAI's AI (Whisper) that not only transcribes audio with very high accuracy, but offers two crucial features for professional use:

Temporal alignment: Assigns a precise time (milliseconds) to every single word.
Diarization (Pyannote): Recognizes and separates the voices of different speakers.

This guide walks you through installation and usage step by step.

PHASE 1: Setting Up Hugging Face and Pyannote

WhisperX's diarization relies on models created by "Pyannote" and hosted on Hugging Face. These models are free but "gated" (access-controlled).

Create a free account on Hugging Face.
Accept the terms of use for pyannote/speaker-diarization-3.1 and pyannote/segmentation-3.0.
Create a token with Read permissions (it will start with hf_...).

PHASE 2: Environment Installation

WhisperX requires an NVIDIA graphics card (GPU). Run these commands:

sudo apt-get update && sudo apt-get install -y ffmpeg
pip install git+https://github.com/m-bain/whisperX.git
pip install transformers==4.50.0
pip install --force-reinstall torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

PHASE 3: Running the Main Command

whisperx "audio_file_name.wav" \
  --model large-v3 \
  --language en \
  --diarize \
  --min_speakers 2 \
  --max_speakers 2 \
  --hf_token YOUR_TOKEN_HERE \
  --compute_type float16 \
  --output_dir /output/folder \
  --output_format all

Parameter Explanation:

--model large-v3: Uses the largest and most accurate Whisper model.
--language [code]: Forces the audio language (e.g. en, it, id).
--diarize: Activates speaker recognition.
--min_speakers / --max_speakers: Helps the AI avoid creating "ghost" voices.
--compute_type float16: Optimizes video RAM usage.
--output_format all: Generates all useful formats.

Note: The Align Model Issue (Non-standard Languages)

If you use less common languages (e.g. Indonesian), add a specific Wav2Vec2 model to the command:
--align_model indonesian-nlp/wav2vec2-large-xlsr-indonesian

PHASE 4: Output Formats

.json: Text, millisecond-level word timings, speaker ID.
.srt / .vtt: Standard subtitle formats.
.txt: Clean plain text.
.tsv: Tabular format for Excel or Sheets.

Downloadable Material

Download the presentation

← Back to resources

WhisperX: Precision Transcriptions