Build an Android chat app with Llama, ExecuTorch, and XNNPACK: Review

Log an issue

Fork and edit

Discuss on Discord

What you've learned

You should now know how to:

Set up an ExecuTorch development environment.
Describe how ExecuTorch uses XNNPACK kernels to accelerate performance on Arm-based platforms.
Describe how 4-bit groupwise PTQ quantization reduces model size without significantly sacrificing model accuracy.
Build and run Llama models using ExecuTorch on your development machine.
Build and run an Android Chat app with different Llama models using ExecuTorch on an Arm-based smartphone.

What is ExecuTorch?

ExecuTorch is an end-to-end solution for enabling on-device inference capabilities across mobile and edge devices.

It is a Pytorch method to quantize LLMs.

It is a program to execute pytorch models.

What is Llama?

A domesticated South American camelid.

A proprietary Large Language Model.

Llama is a family of large language models that uses publicly-available data for training.

Which quantization scheme did you use for an Android app?

8-bit groupwise per token dynamic quantization of all the linear layers.

4-bit groupwise per token dynamic quantization of all the linear layers.

16-bit groupwise per token dynamic quantization of all the linear layers.

No quantization.

Back