Who is this for?
This is an introductory topic for software developers interested in running LLMs using PyTorch on Arm-based servers.
What will you learn?
Upon completion of this learning path, you will be able to:
- Download the Meta Llama 3.1 model from the Meta Hugging Face repository.
- 4-bit quantize the model using optimized INT4 KleidiAI Kernels for PyTorch.
- Run an LLM inference using PyTorch on an Arm-based CPU.
- Expose an LLM inference as a browser application with Streamlit as the frontend and Torchchat framework in PyTorch as the LLM backend server.
- Measure performance metrics of the LLM inference running on an Arm-based CPU.
Prerequisites
Before starting, you will need the following:
- An
Arm-based instance
with at least 16 CPUs from a cloud service provider or an on-premise Arm server.