Download and export the Llama 3 8B model

There are multiple model options available to use with ExecuTorch. Here you will focus on Llama 3 8B, but you can select the model that works best for you.

  1. Download the Llama 3 pretrained parameters from Meta’s official llama3 repository .

  2. Clone the Llama 3 Git repository:

                git clone
  3. Navigate to llama-downloads , enter your email address and accept the license to receive the URL for Llama 3 model downloads.

  4. Download required models:

                cd llama3
    # Enter the URL and desired model
  5. Export model and generate .pte file:

    Run the Python command to export the model:

                python -m examples.models.llama2.export_llama --checkpoint <consolidated.00.pth> -p <params.json> -kv --use_sdpa_with_kv_cache -X -qmode 8da4w  --group_size 128 -d fp32 --metadata '{"get_bos_id":128000, "get_eos_id":128001}' --embedding-quantize 4,32 --output_name="llama3_kv_sdpa_xnn_qe_4_32.pte"

    Where <consolidated.00.pth> and <params.json> are the paths to the downloaded model files, found in llama3/Meta-Llama-3-8B by default.

    Due to the larger vocabulary size of Llama 3, you should quantize the embeddings with --embedding-quantize 4,32 to further reduce the model size.

Download and export stories110M model

Follow the steps in this section, if you want to deploy and run a smaller model for educational purposes instead of the full Llama 3 8B model.

From the executorch root directory follow these steps:

  1. Download and tokenizer.model from Github.

                wget ""
    wget ""
  2. Create params file.

                echo '{"dim": 768, "multiple_of": 32, "n_heads": 12, "n_layers": 12, "norm_eps": 1e-05, "vocab_size": 32000}' > params.json
  3. Export model and generate .pte file.

                python -m examples.models.llama2.export_llama -c -p params.json -X
  4. Create tokenizer.bin.

                python -m examples.models.llama2.tokenizer.tokenizer -t tokenizer.model -o tokenizer.bin

Optional: Evaluate Llama 3 model accuracy

You can evaluate model accuracy using the same arguments as above:


            python -m examples.models.llama2.eval_llama -c <consolidated.00.pth> -p <params.json> -t <tokenizer.model> -d fp32 --max_seq_len 2048 --limit 1000

Model evaluation without a GPU will take a long time. On a MacBook with an M3 chip and 18GB RAM this took 10+ hours.

Validate models on the development machine

Before running models on a smartphone, you can validate them on your development computer.

Follow the steps below to build ExecuTorch and the Llama runner to run models.

  1. Build executorch with optimized CPU performance:

                cmake -DPYTHON_EXECUTABLE=python \
        -DCMAKE_INSTALL_PREFIX=cmake-out \
        -DCMAKE_BUILD_TYPE=Release \
        -Bcmake-out .
    cmake --build cmake-out -j16 --target install --config Release

    The CMake build options are available on GitHub .

  2. Build the Llama runner:


For Llama 3, add -DEXECUTORCH_USE_TIKTOKEN=ON option.


If you are building on a Mac, there is currently an open bug that adds a --gc-sections flag to ld options. You need to remove this flag for Mac by opening examples/models/llama2/CMakeLists.txt and removing these lines:


            if(CMAKE_BUILD_TYPE STREQUAL "Release")
  target_link_options(llama_main PRIVATE "LINKER:--gc-sections,-s")

Run cmake:


                cmake -DPYTHON_EXECUTABLE=python \
        -DCMAKE_INSTALL_PREFIX=cmake-out \
        -DCMAKE_BUILD_TYPE=Release \
        -Bcmake-out/examples/models/llama2 \

    cmake --build cmake-out/examples/models/llama2 -j16 --config Release
  1. Run the model:

                cmake-out/examples/models/llama2/llama_main --model_path=<model pte file> --tokenizer_path=<tokenizer.bin> --prompt=<prompt>

    The run options are available on GitHub .

    For Llama 3, you can pass the original tokenizer.model (without converting to .bin file).