Tune network workloads on Arm-based bare-metal instances: Establish baseline performance

Tune network workloads on Arm-based bare-metal instances

Log an issue

Fork and edit

Discuss on Discord

Tune network workloads on Arm-based bare-metal instances

Overview

In this section, you establish a baseline configuration before applying advanced techniques to tune the performance of Tomcat-based network workloads on an Arm Neoverse bare-metal instance.

Note

To avoid running out of file descriptors under load, raise the file‑descriptor limit on both the server and the client:

    

        
        
ulimit -n 65535

Configure an optimal baseline before tuning

This baseline includes:

Aligning IOMMU settings with Ubuntu defaults
Setting a default CPU configuration
Disabling access logging
Setting optimal thread counts

Align IOMMU settings with Ubuntu defaults

Note

If you are using a cloud image (for example, AWS) with non-default kernel parameters, align IOMMU settings with the Ubuntu defaults: iommu.strict=1 and iommu.passthrough=0.

Edit GRUB and add (or update) GRUB_CMDLINE_LINUX:

    

        
        
    sudo vi /etc/default/grub

Add or update the line to include:

    

        
        
    GRUB_CMDLINE_LINUX="iommu.strict=1 iommu.passthrough=0"

Update GRUB and reboot to apply the settings:

    

        
        
    sudo update-grub && sudo reboot

Verify that the default settings have been successfully applied:

    

        
        
sudo dmesg | grep iommu

You should see that under the default configuration, iommu.strict is enabled, and iommu.passthrough is disabled:

    

        
        [    0.877401] iommu: Default domain type: Translated (set via kernel command line)
[    0.877404] iommu: DMA domain TLB invalidation policy: strict mode (set via kernel command line)
...

Establish a baseline on Arm Neoverse bare-metal instances

Note

To mirror a typical Tomcat deployment and simplify tuning, keep 8 CPU cores online and set the remaining cores offline. Adjust the CPU range to match your instance. The example below assumes 192 CPUs (as on AWS c8g.metal-48xl).

Set CPUs 8–191 offline:

    

        
        
    for no in {8..191}; do sudo bash -c "echo 0 > /sys/devices/system/cpu/cpu${no}/online"; done

Confirm that CPUs 0–7 are online and the rest are offline:

Example output:

    

        
            Architecture:                aarch64
      CPU op-mode(s):            64-bit
      Byte Order:                Little Endian
    CPU(s):                      192
      On-line CPU(s) list:       0-7
      Off-line CPU(s) list:      8-191
    Vendor ID:                   ARM
      Model name:                Neoverse-V2
    ...

Restart Tomcat on the Arm instance:

    

        
        
    ~/apache-tomcat-11.0.10/bin/shutdown.sh 2>/dev/null
    ulimit -n 65535 && ~/apache-tomcat-11.0.10/bin/startup.sh

From your x86_64 benchmarking client, run wrk2 (replace <tomcat_ip> with the server’s IP):

    

        
        
    ulimit -n 65535 && wrk -c1280 -t128 -R500000 -d60 http://<tomcat_ip>:8080/examples/servlets/servlet/HelloWorldExample

Example result:

    

        
              Thread Stats   Avg      Stdev     Max   +/- Stdev
        Latency    16.76s     6.59s   27.56s    56.98%
        Req/Sec     1.97k   165.05     2.33k    89.90%
      14680146 requests in 1.00m, 7.62GB read
      Socket errors: connect 1264, read 0, write 0, timeout 1748
    Requests/sec: 244449.62
    Transfer/sec:    129.90MB

Disable access logging

Disabling access logs removes I/O overhead during benchmarking.

Edit server.xml and comment out (or remove) the org.apache.catalina.valves.AccessLogValve block:

    

        
        
    vi ~/apache-tomcat-11.0.10/conf/server.xml

    

        
        
    <!--
        <Valve className="org.apache.catalina.valves.AccessLogValve" directory="logs"
                prefix="localhost_access_log" suffix=".txt"
                pattern="%h %l %u %t &quot;%r&quot; %s %b" />
    -->

Restart Tomcat:

    

        
        
    ~/apache-tomcat-11.0.10/bin/shutdown.sh 2>/dev/null
    ulimit -n 65535 && ~/apache-tomcat-11.0.10/bin/startup.sh

Re-run wrk2:

    

        
        
    ulimit -n 65535 && wrk -c1280 -t128 -R500000 -d60 http://<tomcat_ip>:8080/examples/servlets/servlet/HelloWorldExample

Example result:

    

        
              Thread Stats   Avg      Stdev     Max   +/- Stdev
        Latency    16.16s     6.45s   28.26s    57.85%
        Req/Sec     2.16k     5.91     2.17k    77.50%
      16291136 requests in 1.00m, 8.45GB read
      Socket errors: connect 0, read 0, write 0, timeout 75
    Requests/sec: 271675.12
    Transfer/sec:    144.36MB

Set optimal thread counts

To minimize contention and context switching, align Tomcat’s CPU‑intensive thread count with available CPU cores.

While wrk2 is running, identify CPU‑intensive Tomcat threads:

    

        
        
    top -H -p "$(pgrep -n java)"

Example output:

    

        
            top - 08:57:29 up 20 min,  1 user,  load average: 4.17, 2.35, 1.22
    Threads: 231 total,   8 running, 223 sleeping,   0 stopped,   0 zombie
    %Cpu(s): 31.7 us, 20.2 sy,  0.0 ni, 31.0 id,  0.0 wa,  0.0 hi, 17.2 si,  0.0 st
    MiB Mem : 386127.8 total, 380676.0 free,   4040.7 used,   2801.1 buff/cache
    MiB Swap:      0.0 total,      0.0 free,      0.0 used. 382087.0 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   4677 ubuntu    20   0   36.0g   1.4g  24452 R  89.0   0.4   1:18.71 http-nio-8080-P
   4685 ubuntu    20   0   36.0g   1.4g  24452 R   4.7   0.4   0:04.42 http-nio-8080-A
   4893 ubuntu    20   0   36.0g   1.4g  24452 S   3.3   0.4   0:00.60 http-nio-8080-e
   4963 ubuntu    20   0   36.0g   1.4g  24452 S   3.3   0.4   0:00.66 http-nio-8080-e
   4924 ubuntu    20   0   36.0g   1.4g  24452 S   3.0   0.4   0:00.59 http-nio-8080-e
   4955 ubuntu    20   0   36.0g   1.4g  24452 S   3.0   0.4   0:00.60 http-nio-8080-e
   5061 ubuntu    20   0   36.0g   1.4g  24452 S   3.0   0.4   0:00.61 http-nio-8080-e
   4895 ubuntu    20   0   36.0g   1.4g  24452 S   2.7   0.4   0:00.58 http-nio-8080-e
   4907 ubuntu    20   0   36.0g   1.4g  24452 S   2.7   0.4   0:00.59 http-nio-8080-e
   4940 ubuntu    20   0   36.0g   1.4g  24452 S   2.7   0.4   0:00.58 http-nio-8080-e
   4946 ubuntu    20   0   36.0g   1.4g  24452 S   2.7   0.4   0:00.59 http-nio-8080-e
   4956 ubuntu    20   0   36.0g   1.4g  24452 S   2.7   0.4   0:00.65 http-nio-8080-e
   4959 ubuntu    20   0   36.0g   1.4g  24452 S   2.7   0.4   0:00.59 http-nio-8080-e
   4960 ubuntu    20   0   36.0g   1.4g  24452 R   2.7   0.4   0:00.60 http-nio-8080-e
   4962 ubuntu    20   0   36.0g   1.4g  24452 S   2.7   0.4   0:00.57 http-nio-8080-e
   4982 ubuntu    20   0   36.0g   1.4g  24452 S   2.7   0.4   0:00.63 http-nio-8080-e
   4983 ubuntu    20   0   36.0g   1.4g  24452 S   2.7   0.4   0:00.58 http-nio-8080-e
   4996 ubuntu    20   0   36.0g   1.4g  24452 S   2.7   0.4   0:00.60 http-nio-8080-e
   5033 ubuntu    20   0   36.0g   1.4g  24452 S   2.7   0.4   0:00.59 http-nio-8080-e
   5036 ubuntu    20   0   36.0g   1.4g  24452 S   2.7   0.4   0:00.66 http-nio-8080-e
   5056 ubuntu    20   0   36.0g   1.4g  24452 S   2.7   0.4   0:00.61 http-nio-8080-e
   5065 ubuntu    20   0   36.0g   1.4g  24452 S   2.7   0.4   0:00.56 http-nio-8080-e
   5068 ubuntu    20   0   36.0g   1.4g  24452 S   2.7   0.4   0:00.61 http-nio-8080-e
   5070 ubuntu    20   0   36.0g   1.4g  24452 S   2.7   0.4   0:00.60 http-nio-8080-e
   5071 ubuntu    20   0   36.0g   1.4g  24452 S   2.7   0.4   0:00.61 http-nio-8080-e
...

You’ll typically see http-nio-8080-e and http-nio-8080-P threads as CPU-intensive. Because the http-nio-8080-P thread count is fixed at 1 (in current Tomcat releases), and you have 8 online CPU cores, set http-nio-8080-e to 7.

Edit server.xml and update the HTTP connector to set the worker thread counts and connection limits:

    

        
        
    vi ~/apache-tomcat-11.0.10/conf/server.xml

Replace the existing connector:

    

        
        
    <!-- Before -->
        <Connector port="8080" protocol="HTTP/1.1"
                   connectionTimeout="20000"
                   redirectPort="8443" />

With the tuned settings:

    

        
        
    <!-- After -->
        <Connector port="8080" protocol="HTTP/1.1"
                   connectionTimeout="20000"
                   redirectPort="8443"
                   minSpareThreads="7"
                   maxThreads="7"
                   maxKeepAliveRequests="500000"
                   maxConnections="100000"
        />

Restart Tomcat and re-run wrk2:

    

        
        
    ~/apache-tomcat-11.0.10/bin/shutdown.sh 2>/dev/null
    ulimit -n 65535 && ~/apache-tomcat-11.0.10/bin/startup.sh

    ulimit -n 65535 && wrk -c1280 -t128 -R500000 -d60 http://<tomcat_ip>:8080/examples/servlets/servlet/HelloWorldExample

Example result:

    

        
              Thread Stats   Avg      Stdev     Max   +/- Stdev
        Latency    10.26s     4.55s   19.81s    62.51%
        Req/Sec     2.86k    89.49     3.51k    77.06%
      21458421 requests in 1.00m, 11.13GB read
    Requests/sec: 357835.75
    Transfer/sec:    190.08MB

With a solid baseline in place, you’re ready to proceed to NIC queue tuning, NUMA locality optimization, and IOMMU exploration in the next sections.

Back

Tune network workloads on Arm-based bare-metal instances

Introduction

Set up Tomcat

Establish baseline performance

Tune performance with NIC queue counts

NUMA-based tuning

IOMMU-based tuning

Summary

Next Steps

Tune network workloads on Arm-based bare-metal instances

Overview

Configure an optimal baseline before tuning

Align IOMMU settings with Ubuntu defaults

Establish a baseline on Arm Neoverse bare-metal instances

Disable access logging

Set optimal thread counts