The Voice Assistant application implements a full voice interaction pipeline on Android, enabling real-time, conversational interactions.
The voice interaction pipeline.
It generates intelligent responses using:
The following sections describe how each component works in the application.
Speech-to-Text (STT), also known as Automatic Speech Recognition (ASR), converts spoken language into written text.
This process includes the following stages:
Large Language Models (LLMs) enable natural language understanding and, in this application, are used for question-answering.
The text transcription from the previous part of the pipeline is used as input to the neural model. At initialization, the app sets a predefined persona that influences the tone, style, and character of the responses.
By default, the LLM runs asynchronously, streaming tokens as they are generated. The UI updates in real time with each token, which is also passed to the final pipeline stage.
This part of the application pipeline uses the Android Text-to-Speech API along with additional logic to produce smooth, natural speech.
In synchronous mode, speech playback begins only after the full LLM response is received. By default, the application operates in asynchronous mode, where speech synthesis starts as soon as a full or partial sentence is ready. Remaining tokens are buffered and processed by the Android Text-to-Speech engine to ensure uninterrupted playback.