Introduction
Build an offline voice assistant with whisper and vLLM
Install faster-whisper for local speech recognition
Build a real-time STT pipeline on CPU
Fine-tune segmentation parameters
Build a real-time offline voice chatbot using STT and vLLM
Connect speech recognition to vLLM for real-time voice interaction
Specialize offline voice assistants for customer service
Enable context-aware dialogue with short-term memory
Next Steps
Now that both faster-whisper and vLLM are working independently, it’s time to connect them into a real-time speech-to-response pipeline. Your system will listen to live audio, transcribe it, and send the transcription to vLLM to generate an intelligent reply - all running locally without cloud services.
For a robust and production-aligned architecture, separate the system into two independent processes:
This separation has several advantages:
Separating container startup from model launch provides greater control and improves development experience.
By launching the container first, you can troubleshoot errors like model path issues or GPU memory limits directly inside the environment, without the container shutting down immediately. It also speeds up iteration: you avoid reloading the entire image each time you tweak settings or restart the model.
This structure also improves visibility. You can inspect files, monitor GPU usage, or run diagnostics like curl and nvidia-smi inside the container. Breaking these steps apart makes the process easier to understand, debug, and extend.
Start the Docker container:
export LATEST_VLLM_VERSION=25.11-py3
docker run --gpus all \
-p 8000:8000 \
-v $HOME/models:/models \
-e NVIDIA_VISIBLE_DEVICES=all \
-it nvcr.io/nvidia/vllm:${LATEST_VLLM_VERSION} bash
Inside the container, launch vLLM:
vllm serve /models/mistral-7b \
--quantization gptq \
--gpu-memory-utilization 0.9 \
--max-num-seqs 8 \
--dtype float16
Look for “Application startup complete.” in the output:
(APIServer pid=1) INFO: Started server process [1]
(APIServer pid=1) INFO: Waiting for application startup.
(APIServer pid=1) INFO: Application startup complete.
The vLLM server is now live and ready to accept HTTP requests.
Now that you’ve implemented a real-time speech recognizer, extend the pipeline by connecting it to a local language model (LLM) powered by vLLM.
You’ll convert the STT result into a message prompt, send it to the running vLLM server via HTTP, dynamically estimate max_tokens based on input length, and print the model’s reply next to the transcribed speech.
Set up the LLM endpoint and model reference by adding the following variables at the top of your script:
VLLM_ENDPOINT = "http://localhost:8000/v1/chat/completions"
MODEL_NAME = "/models/mistral-7b"
Make sure these match the vLLM server you launched in the previous step.
After transcribing the user’s speech, send the result to the vLLM server by formatting it as a chat prompt.
user_text = " ".join([seg.text.strip() for seg in segments]).strip()
Estimate token length and send the request:
max_tokens = min(256, max(64, len(user_text.split()) * 5))
response = requests.post(VLLM_ENDPOINT, json={
"model": MODEL_NAME,
"messages": [{"role": "user", "content": user_text}],
"max_tokens": max_tokens
})
Extract the assistant’s reply from the vLLM API response:
result = response.json()
reply = result["choices"][0]["message"]["content"].strip()
Display both the transcribed input and the model’s response:
...
print(f"\n User: {user_text}\n")
...
print(f" AI : {reply}\n")
This architecture mirrors the OpenAI Chat API design, enabling future enhancements like system-level prompts, multi-turn history, or role-specific behavior.
If you encounter a “model doesn’t exist” error, double-check the model path you used when launching vLLM. It must match MODEL_NAME exactly.
Switch to another terminal and save the following Python code in a file named stt-client.py:
import pyaudio
import numpy as np
import webrtcvad
import time
import torch
import threading
import queue
import requests
from faster_whisper import WhisperModel
from collections import deque
# --- Parameters ---
SAMPLE_RATE = 16000
FRAME_DURATION_MS = 30
FRAME_SIZE = int(SAMPLE_RATE * FRAME_DURATION_MS / 1000)
VAD_MODE = 3
SILENCE_LIMIT_SEC = 1.0
MIN_SPEECH_SEC = 2.0
VLLM_ENDPOINT = "http://localhost:8000/v1/chat/completions"
MODEL_NAME = "/models/mistral-7b"
# --- Init VAD and buffers ---
vad = webrtcvad.Vad(VAD_MODE)
speech_buffer = deque()
speech_started = False
last_speech_time = time.time()
# --- Init Thread and Queue ---
audio_queue = queue.Queue()
stop_event = threading.Event()
# --- Init Whisper model ---
device = "cpu" # "cpu" or "gpu"
compute_type = "int8" # "int8" or "float16", "int8", "int4"
model = WhisperModel("medium.en", device=device, compute_type=compute_type)
# --- Audio capture thread ---
def audio_capture():
pa = pyaudio.PyAudio()
stream = pa.open(format=pyaudio.paInt16,
channels=1,
rate=SAMPLE_RATE,
input=True,
frames_per_buffer=FRAME_SIZE)
print(" Listening... Press Ctrl+C to stop")
try:
while not stop_event.is_set():
frame = stream.read(FRAME_SIZE, exception_on_overflow=False)
audio_queue.put(frame)
finally:
stream.stop_stream()
stream.close()
pa.terminate()
# --- Start audio capture thread ---
threading.Thread(target=audio_capture, daemon=True).start()
# --- Main loop: process queue and transcribe ---
try:
while True:
if audio_queue.empty():
time.sleep(0.01)
continue
frame = audio_queue.get()
is_speech = vad.is_speech(frame, SAMPLE_RATE)
if is_speech:
speech_buffer.append(frame)
speech_started = True
last_speech_time = time.time()
elif speech_started:
speech_duration = len(speech_buffer) * (FRAME_DURATION_MS / 1000.0)
silence_duration = time.time() - last_speech_time
if silence_duration > SILENCE_LIMIT_SEC:
if speech_duration >= MIN_SPEECH_SEC:
print(" Transcribing buffered speech...")
audio_bytes = b"".join(speech_buffer)
audio_np = np.frombuffer(audio_bytes, dtype=np.int16).astype(np.float32) / 32768.0
segments, _ = model.transcribe(audio_np, language="en")
user_text = " ".join([seg.text.strip() for seg in segments]).strip()
print(f"\n User: {user_text}\n")
max_tokens = 128
response = requests.post(VLLM_ENDPOINT, json={
"model": MODEL_NAME,
"messages": [
{"role": "user", "content": user_text}
],
"max_tokens": max_tokens
})
result = response.json()
if "choices" not in result:
print(" Error from vLLM:", result.get("error", "Unknown error"))
continue
reply = result["choices"][0]["message"]["content"].strip()
print(f" AI : {reply}\n")
else:
print(f" Skipped short segment ({speech_duration:.2f}s < {MIN_SPEECH_SEC}s)")
speech_buffer.clear()
speech_started = False
except KeyboardInterrupt:
print(" Stopped")
finally:
stop_event.set()
Run the code using:
python3 ./stt-client.py
Once both your vLLM server and Python STT script are running correctly, you’ll see output like the following in your terminal.
Each time you speak a full sentence (based on your silence and segment thresholds), the system transcribes your speech, displays the recognized text, and shows the model’s reply in natural language.
If your input is too short (a false trigger or background noise spike), you’ll see a message like:
Skipped short segment (1.32s < 2.0s)
This means your speech did not meet the MIN_SPEECH_SEC threshold. You can adjust this value to make the system more or less sensitive.
An example when asking the assistant for a joke:
Listening... Press Ctrl+C to stop
Skipped short segment (0.39s < 2.0s)
Skipped short segment (1.44s < 2.0s)
Skipped short segment (1.89s < 2.0s)
Skipped short segment (1.77s < 2.0s)
Skipped short segment (0.36s < 2.0s)
Transcribing buffered speech...
Listening... Press Ctrl+C to stop
Transcribing buffered speech...
User: Hello, please tell me the joke.
AI : Of course, I'd be happy to tell you a joke! Here's a classic one:
Why don't libraries smell like popcorn?
Because they are full of books, not movies!
I hope that brings a smile to your face. If you have any other requests, feel free to ask!
If your input is too short, you’ll see:
Skipped short segment (1.32s < 2.0s)
You’ve successfully built a complete voice-to-AI-response loop: microphone input is captured in real time, transcribed locally using faster-whisper on CPU, forwarded to a local vLLM server running on GPU, and receives intelligent responses with low latency.
This foundation supports a wide range of customizations where you can build customer-specific workflows with prompt engineering and multi-turn memory.