Recently I've been tuning our market data feed processors. This is a single threaded "straight line speed" application that reads data from the network, parses it, converts to our internal data format, and writes out again.
I've already reduced the time taken for a benchmark workload (parse a large data feed file) by 68% using only code changes (going to blog this soon) but wanted to see if OS-level optimisations could reduce this further.
Here is an experiment into pinning the main application thread to a CPU core to see if that prevents loss of performance when the OS scheduler moves the thread between CPU cores or sockets (thus losing cache contents).
Results were measured using
perf stat -o perf.log java MyWorkloadClass
On Linux you can use the taskset command to pin a process to a CPU core.
First you need to discover the native ID for the Java thread you want to pin.
The steps for doing that are:
1) Get the Java process ID (this assumes you have 1 Java process)
pgrep java
2) Use the jstack tool to create a thread dump for the Java process ID.
jstack -l <pid>
The result will be something like:
"main" #1 prio=5 os_prio=0 tid=0x0000000002128800 nid=0x255b runnable [0x00007f3a00398000]
java.lang.Thread.State: RUNNABLE
...
3) Use grep on the result to get the named thread you want to pin (e.g. main) and grep the line containing nid (the native ID attribute).
grep main <stack dump> | grep nid
4) Use awk to get the nid parameter (and convert from hex to decimal)
awk -F'nid=| runnable' '{printf "%d",$2}'
This can be condensed into one Linux command:
jstack -l `pgrep java` | grep main | grep nid | awk -F'nid=| runnable' '{printf "%d",$2}'
The final step is to take this decimal native Java thread ID and pin it to a CPU core (core 1 here):
taskset -p 0x00000001 <java thread pid>
Here are the results (perf output) for the sample workload with and without CPU pinning:
Pinning the Java thread reduced the workload time from 293 seconds to 225 seconds (23% reduction). Test performed on single socket workstation, not a server.
cat /proc/cpuinfo | grep "model name"
Intel(R) Core(TM) i5 CPU         660  @ 3.33GHz
Intel(R) Core(TM) i5 CPU         660  @ 3.33GHz
Intel(R) Core(TM) i5 CPU         660  @ 3.33GHz
Intel(R) Core(TM) i5 CPU         660  @ 3.33GHz
I will test on a server class machine next week.
Perf stats without CPU pinning
297481.389002 task-clock                    #    1.014 CPUs utilized          
           35,674 context-switches          #    0.000 M/sec                  
            1,405 CPU-migrations            #    0.000 M/sec                  
          110,733 page-faults               #    0.000 M/sec                  
1,063,417,614,122 cycles                    #    3.575 GHz                     [83.34%]
  161,224,883,847 stalled-cycles-frontend   #   15.16% frontend cycles idle    [83.34%]
  144,636,718,849 stalled-cycles-backend    #   13.60% backend  cycles idle    [66.66%]
2,376,914,490,864 instructions              #    2.24  insns per cycle        
                                            #    0.07  stalled cycles per insn [83.33%]
  657,693,070,884 branches                  # 2210.871 M/sec                   [83.33%]
    8,735,293,464 branch-misses             #    1.33% of all branches         [83.34%]

    293.502171112 seconds time elapsed
Perf stats with CPU pinning
    229904.273198 task-clock                #    1.019 CPUs utilized          
           27,529 context-switches          #    0.000 M/sec                  
            1,219 CPU-migrations            #    0.000 M/sec                  
          144,662 page-faults               #    0.001 M/sec                  
  821,212,115,937 cycles                    #    3.572 GHz                     [83.33%]
   64,393,588,977 stalled-cycles-frontend   #    7.84% frontend cycles idle    [83.33%]
   48,551,948,602 stalled-cycles-backend    #    5.91% backend  cycles idle    [66.66%]
2,209,615,996,688 instructions              #    2.69  insns per cycle        
                                            #    0.03  stalled cycles per insn [83.34%]
  623,314,562,813 branches                  # 2711.192 M/sec                   [83.33%]
      366,547,412 branch-misses             #    0.06% of all branches         [83.35%]

    225.562024911 seconds time elapsed