Yes. The advancements made in the Generative AI space with new model formats like GGUF and smaller parameter models make LLM inference on CPUs very efficient.
Yes. By default llama.cpp is built for CPU only on Linux and Windows.
Can you profile the time taken by the model to generate the output until the end of text?
llama.cpp prints a few timing parameters at the end of the execution of the LLM. One of these timing parameters is the eval time which is the time taken by the model to generate the output.