Build an Android chat app with Llama, KleidiAI, ExecuTorch, and XNNPACK: Understanding Llama models

Build an Android chat app with Llama, KleidiAI, ExecuTorch, and XNNPACK

Log an issue

Fork and edit

Discuss on Discord

Build an Android chat app with Llama, KleidiAI, ExecuTorch, and XNNPACK

What is Llama?

Llama is a family of large language models that uses publicly-available data for training. Llama models have shown to perform well on a variety of natural language processing tasks, such as:

Language translation.
Question answering.
Text summarization.

Llama models are also capable of generating human-like text, making them a useful tool for creative writing and other applications where natural language generation is key.

Llama models are powerful and versatile, having the ability to generate coherent and contextually-relevant text which makes them particularly useful for applications such as:

Chatbots.
Virtual assistants.
Language translation.

Please note that the models are subject to the acceptable use policy and this responsible use guide .

Results

As Llama 2 and Llama 3 models require at least 4-bit quantization due to the constraints of available memory of some smartphones, the results presented in this Learning Path correspond to 4-bit groupwise post-training quantized models.

Quantization

One way to create models that fit in smartphone memory is to employ 4-bit groupwise per token dynamic quantization of all the linear layers of the model. Dynamic quantization refers to quantizing activations dynamically, such that quantization parameters for activations are calculated, from the min/max range, at runtime. Furthermore, weights are statically quantized. In this case, weights are per-channel groupwise quantized with 4-bit signed integers.

For further information, refer to torchao: PyTorch Architecture Optimization .

The table below evaluates WikiText perplexity using LM Eval .

The results are for two different groupsizes, with max_seq_len 2048, and 1000 samples:

Model	Baseline (FP32)	Groupwise 4-bit (128)	Groupwise 4-bit (256)
Llama 2 7B	9.2	10.2	10.7
Llama 3 8B	7.9	9.4	9.7

Note that groupsize less than 128 was not enabled in this example, since the model was still too large. This is because current efforts have focused on enabling FP32, and support for FP16 is under way.

What this implies for model size is:

Embedding table is in FP32.
Quantized weights scales are FP32.

Back

Build an Android chat app with Llama, KleidiAI, ExecuTorch, and XNNPACK

Introduction

Create a development environment

ExecuTorch Setup

Understanding Llama models

Prepare Llama models for ExecuTorch

Run Benchmark on Android phone

Build and Run Android chat app

Next Steps

Build an Android chat app with Llama, KleidiAI, ExecuTorch, and XNNPACK

What is Llama?

Results

Quantization