Run a Large Language Model (LLM) chatbot with PyTorch using KleidiAI on Arm servers

What you've learned

You should now know how to:

  • Download the Meta Llama 3.1 model from the Meta Hugging Face repository.
  • 4-bit quantize the model using optimized INT4 KleidiAI Kernels for PyTorch.
  • Run an LLM inference using PyTorch on an Arm-based CPU.
  • Expose an LLM inference as a browser application with Streamlit as the frontend and Torchchat framework in PyTorch as the LLM backend server.
  • Measure performance metrics of the LLM inference running on an Arm-based CPU.

Knowledge Check

Can you run PyTorch on Arm CPUs?

Which quantization scheme was utilized for the LLM model?

What framework was used to create the front-end application?


Back
Next