By converting FP32 tensors to INT4, the number of bits needed to represent the tensor decreases dramatically, with a smaller memory footprint as a result.
What quantization scheme does Llama require to run on an embedded device such as the Raspberry Pi 5?
The 4-bit quantization scheme yields the smallest memory footprint for Llama 3 in this case.
Dynamic quantization happens at runtime.
Dynamic quantization refers to quantizing activations at runtime.