In this Learning Path, we use the terms hardware counter and event counter interchangeably.
Software events are generated by the Linux kernel or user software. Examples of software events that can be measured are context switches and system calls. Hardware events are generated by the CPU or other system hardware. Examples of hardware events are instructions executed and CPU clock cycles. This Learning Path focuses on hardware events.
Arm hardware events are managed by the Performance Monitoring Unit (PMU). This unit contains the system registers that configure event counting, it is also where counter results are stored. The number of hardware events that can be counted at the same time is limited. Arm CPUs typically support 4-8 counters. The number of supported hardware events can be found in the Technical Reference Manual (TRM) of each CPU. There is also a dedicated counter for CPU clock cycles which does not occupy any of the 4-8 event slots. Last, the PMU supports software increment counters which can be used to count things such as accesses to a specific data structure.
If you need to count more hardware events than the available counters, you can multiplex different counters over a measurement period. For example, if the CPU supports 6 counters, and you want to count 12 different events, you can swap in and out a set of 6 events over the measurement period. However, this means that the counter results will need to be extrapolated over the total measurement period due to the swapping. When multiplexing is implemented, the final scaled counter results should be taken as an estimate of the total events counted. This may be acceptable for many cases, but if your debug and analysis work is done methodically, you usually can narrow down the number of counters needed to a number that doesn’t require you to multiplex. Avoiding multiplexing is preferable as it keeps the counter results more accurate.
All available hardware events and their unique event numbers are found in the Technical Reference Manual of a CPU. For example, if you are interested in the hardware events supported by the Neoverse N2, review the Neoverse N2 TRM .
It’s helpful to have a basic understanding of Arm exception levels because it impacts counter setup. The Arm Architecture A-profile reference manual defines 4 exception levels. These are called EL0 (required), EL1 (required), EL2 (optional), and EL3 (optional). For Neoverse cores, all 4 levels are implemented because Neoverse based platforms usually need to support virtualization. The easiest way to think of these levels is through the lens of execution privilege. User space code executes in EL0, kernel code executes in EL1, hypervisor code executes in EL2, and firmware executes in EL3. Arm CPUs will enforce this execution privilege at the hardware level. For example, by default, EL0 (user) code, cannot access the PMU configuration registers. For EL0 access to work, EL1 (kernel) code needs to enable PMU access for EL0 (user) code. Once this happens, user programs will be allowed to configure and read counters. Most methods for PMU access take care of this for you, however, it’s good to have this understanding in case you decide implement custom/assembly code for counter access.
Before you instrument counters, you should consider using tools which do not require you to write code in order to access counters. These tools are discussed below.
Linux Perf is part of the Linux source code (under tools/perf). It is capable of measuring software and hardware events. It is used for measuring events at the process or system level. Depending on what you are working on, Perf can save you the need to instrument counters directly in your code. Refer to the Perf on Arm Linux install guide to learn how to install Perf. There is also a walk through on perf and its features published by Brendan Gregg .
Arm publishes Telemetry Solution - a tool that does not require code to be written. In fact, it uses Linux perf. This tool is accompanied with a general performance analysis methodology. It allows you to separate performance bottlenecks between the front-end and the back-end of the CPU. Using this methodology, you can measure things like branch effectiveness, cache effectiveness, instruction mix, etc. This tool will continue to grow in capabilities over time. It’s strongly recommended to try this tool before instrumenting your code.
If you decide to add counter instrumentation to your source code, various methods allow this from user space. The method you use should be determined by a combination of preference and whatever limitations you may have in your environment.
If all you need to do is count time, you can use a system timer instead of the PMU. This requires the least amount of code and is the quickest way to get started.
This Learning Path contains an example of using a system counter.
The Performance Application Programming Interface (PAPI) is a tool for instrumenting hardware and software events in your code. It supports both C/C++ and Fortran. PAPI relies on a library called libpfm4 which uses the Linux perf_events infrastructure to configure and count events. If your platform is not listed as supported by libpfm4, it doesn’t mean PAPI won’t work. It is worth trying PAPI even if you do not see your specific Arm CPU implementation listed as supported. Another advantage of PAPI is that it is capable of managing event multiplexing for you.
This Learning Path contains a PAPI based instrumentation example.
The Linux perf_events infrastructure is another way hardware and software events can be counted. In fact, libpfm4 and Linux Perf both use this infrastructure. The perf_event_open
system call can be used to instrument counters in your code. However, if multiplexing of events is required, you will need to implement that yourself. The documentation on how to use this interface isn’t as good as PAPI and it may require some trial and error.
This Learning Path contains a perf_event_open
based instrumentation example.
The Linux kernel contains a tool called eBPF (extended Berkeley Packet Filter) that can be used for event counting. This tool is complex and the above methods should be considered before trying eBPF. In fact, eBPF should only be used for counting if you are already using eBPF as part of a broader performance investigation. For this reason, this Learning Path does not contain an example of how to use eBPF for event counting.
The easiest way to instrument non-C/C++ programs is to write a C library and call it from your non-C/C++ program. For example, in Java, it is possible to use the Java Native Interface (JNI) to call C/C++ functions. There may also be tools in other environments that enable access to performance counters. Perhaps in a future revision of this learning path, non-C/C++ examples will be discussed.
You can enable and configure hardware counters using assembly code. Counting events this way requires knowledge of the specific PMU registers that are required to enable and configure the counters of interest. This method requires that you implement multiplexing if you need to count more events than the available CPU counters. This method for counter access is not covered in this learning path because the other methods outlined in this learning path are easier.