Before running the app, download a GGUF model file compatible with mobile device memory constraints. To run on a typical Android phone with 8 GB RAM, the model size should be significantly smaller than 8 GB to leave room for the operating system and other apps.
A good example model is google_gemma-3-4b-it-Q4_0.gguf . Gemma 3 is a capable model, and this 4 billion parameter version has been quantized with the Q4_0 schema, which works particularly well with Arm’s KleidiAI library . This quantization enables speed-ups on phones with SME2 , SVE2 , and Neon capabilities.
Download Gemma 3 or another suitable GGUF model to your phone’s Downloads folder.
In Android Studio, if you connect your test Android phone with a USB cable to your computer, you should now be able to run your LLM chatbot app. Make sure when you connect the phone it is in Developer Mode and you allow USB debugging.
In the bottom right there is a button “Import model”. Clicking this will take you to downloads to be able to select the model you’ve downloaded, so the app can download it. Once it has finished copying and loading the model it will say “Model ready” at the top of the screen. Now if you click the text entry area at the bottom, you can type your questions and chat with the LLM.
AI Chat app running Gemma 3 4B
You now have a working on-device LLM chatbot running on your Android phone. The AI Chat library handles model loading, tokenization, and inference, with optimizations for Arm CPUs automatically applied based on your device’s capabilities.
If the model fails to load:
If inference is slow: