Introduction
Build an offline voice assistant with whisper and vLLM
Install faster-whisper for local speech recognition
Build a real-time STT pipeline on CPU
Fine-tune segmentation parameters
Build a real-time offline voice chatbot using STT and vLLM
Connect speech recognition to vLLM for real-time voice interaction
Specialize offline voice assistants for customer service
Enable context-aware dialogue with short-term memory
Next Steps
After applying the previous steps-model upgrade, VAD, smart turn detection, and multi-threaded audio collection - you now have a high-quality, CPU-based local speech-to-text system.
At this stage, the core pipeline is complete. What remains is fine-tuning: adapting the system to your environment, microphone setup, and speaking style. This flexibility is one of the key advantages of a fully local STT pipeline.
By adjusting a small set of parameters, you can significantly improve transcription stability and user experience.
No two environments are the same. A quiet office, a noisy lab, and a home setup with background music all require different segmentation behavior. Similarly, users speak at different speeds and with different pause patterns.
Fine-tuning allows you to:
The following parameters control how speech is segmented and when transcription is triggered:
Model choice directly impacts accuracy and performance:
Choosing the right model ensures an optimal balance between speed and transcription quality.
webrtcvad provides multiple aggressiveness levels (0-3):
Adjust this setting based on background noise and microphone quality.
MIN_SPEECH_SEC and SILENCE_LIMIT_SECMIN_SPEECH_SEC: This parameter defines the minimum duration of detected speech needed before a segment is considered valid. Use this to filter out very short utterances such as false starts or background chatter.
SILENCE_LIMIT_SEC: This parameter defines how long the system waits after speech stops before finalizing a segment.
Based on practical experiments, the following presets provide a good starting point:
| Usage Scenario | MIN_SPEECH_SEC | SILENCE_LIMIT_SEC | Description |
|---|---|---|---|
| Short command phrases | 0.8 | 0.6 | Optimized for quick voice commands such as “yes”, “next”, or “stop”. Prioritizes responsiveness over sentence completeness. |
| Natural conversational speech | 1.0 | 1.0 | Balanced settings for everyday dialogue with natural pauses between phrases. |
| Long-form explanations such as tutorials | 2.0 | 2.0 | Designed for longer sentences and structured explanations, reducing the risk of premature segmentation. |
To modify these parameters in your transcribe.py file, adjust the values at the top of the script:
# --- Parameters ---
SAMPLE_RATE = 16000
FRAME_DURATION_MS = 30
FRAME_SIZE = int(SAMPLE_RATE * FRAME_DURATION_MS / 1000)
VAD_MODE = 3 # Adjust: 0-3 (higher = more aggressive)
SILENCE_LIMIT_SEC = 1.0 # Adjust based on use case
MIN_SPEECH_SEC = 2.0 # Adjust based on use case
For conversational use, start with SILENCE_LIMIT_SEC = 1.0 and MIN_SPEECH_SEC = 1.0. If you experience premature sentence breaks, increase both values. If the system feels sluggish, decrease them.
You can also experiment with different faster-whisper models by changing:
model = WhisperModel("medium.en", device=device, compute_type=compute_type)
Replace "medium.en" with "small.en" for faster performance or "large-v2" for higher accuracy.
You now understand how to fine-tune your STT system for different environments and use cases. These adjustments allow you to optimize the balance between responsiveness and transcription quality based on your specific needs.
In the next section, you’ll integrate this STT system with vLLM to add natural language understanding and response generation, completing your offline voice assistant.