In this section, you’ll learn the memory hierarchy concepts the worked example builds on. It’s not an exhaustive explanation, but it covers what you’ll need to interpret the profiling results.
Modern Arm Neoverse server CPUs use a hierarchy of memories to reduce the cost of loading and storing data. The fastest storage sits close to each CPU core, while larger memories sit farther away and take more cycles to access.
You usually see the following:
L1d) and L1 instruction cache (L1i) close to each core with each access usually taking up to 10 cycles.To inspect cache topology on an Arm Neoverse server, see the
Learning Path for Arm’s Sysreport tool
or use the lscpu command. Unlike lscpu, Sysreport also reports the set associativity for each cache level. For example, you can run the following command on a system with git and python installed:
git clone https://github.com/ArmDeveloperEcosystem/sysreport.git
cd sysreport
python3 src/sysreport.py | grep -i cache -A 4
Depending on your system, the output is similar to:
cache info: size, associativity, sharing
cache line size: 64
Caches:
64 x L1D 64K 4-way 64b-line
64 x L1I 64K 4-way 64b-line
64 x L2U 1M 8-way 64b-line
1 x L3U 32M 16-way 64b-line
For a more visual view, install hwloc and generate a topology image:
sudo apt update
sudo apt install -y hwloc
hwloc-ls --of png > topology.png
Example hardware locality topology
The diagram shows cache tiers on an AWS Graviton3 bare metal server based on Neoverse V1. Each of the 64 cores has private L1d, L1i, and L2 caches, and all cores share one L3 cache, sometimes referred to as last-level cache (LLC). Cache sizes, especially at later levels, are not fixed by the Neoverse architecture. Implementers such as AWS or Google can configure larger or smaller caches based on design goals.
Non-uniform memory access (NUMA) means memory latency can depend on which processor or socket owns the memory being accessed. On this AWS Graviton3 instance, there is only one NUMA node.
To get a comprehensive system-level understanding of the memory subsystem, see the Learning Path on the Arm system characterization tool .
Applications use virtual addresses, which are the addresses a program sees instead of physical DRAM locations. With virtual addressing, the operating system isolates processes, protects memory, and maps each program’s address space to available physical memory. The processor translates virtual addresses to physical addresses before it accesses memory.
The translation lookaside buffer (TLB) caches recent virtual-to-physical translations at page granularity to avoid page table walks. A TLB miss occurs when the needed translation is not cached, so the processor performs a page table walk to find the mapping. Page walks add latency before a load or store can complete. Large working sets and irregular access patterns, such as strides larger than the typical 4KB page size, can increase TLB pressure because the program touches many pages with little reuse.
A minor page fault is usually harmless: the data is already in RAM, and the kernel only creates the mapping. This fault commonly happens during anonymous paging when Linux lazily backs newly allocated heap or stack memory on first touch. A major page fault is more expensive because the kernel must fetch the page from disk, such as from a file or swap, so repeated major faults are a real performance concern.
The working set is the data your program actively touches during a period of execution. It differs from resident set size (RSS), which is the amount of physical memory currently resident for a process. A process can have a large RSS while the hot loop actively uses only a smaller working set.
From a programmer’s perspective, much of the cache and memory subsystem is a black box defined by processor architecture and implementation. Features such as cache associativity, prefetching, and translation caching are designed to hide latency across many workloads. Your main software levers are data structure layout, allocation patterns, and choices such as page size. The layout of your C++ data structures can determine whether the memory hierarchy helps or hurts performance. The compiler generally can’t reorder structure fields or split objects automatically because that would change program semantics.
You’ve now learned about CPU memory hierarchy, memory access, and relevant memory and translation terminology to understand profiling results for the example application that you’ll use in this Learning Path.
Next, you’ll set up and build the example C++ application.