Can splitting Java core class methods increase performance?

Published February 20th, 2014

An experiment to see if splitting hot methods that are too big to inline can increase performance

Warning

This experiment is at the bleeding edge of my understanding of HotSpot's JIT compiler and also the jmh benchmarking tool. It could contain code errors, statistical errors, and basic misunderstandings so if you spot any mistakes or bad assumptions please let me know :)

Background

About a year ago I was reading The Well Grounded Java Developer and got interested in learning about bytecode and the JIT compiler so I started writing a tool to make it easier to understand the wieldy XML of hotspot.log. That tool is now an open source project called JITWatch and has been accepted as an AdoptOpenJDK project.

The latest feature I am working on is a suggestion tool whose purpose is to identify the low-hanging fruit when it comes to optimising your Java code so that the JIT compiler can do the best possible job.

One of the HotSpot C2 (Server) JIT compiler's optimisation tricks is to inline methods to save the invocation cost and it will always try to inline any method consisting of 35 bytecodes or less (use VM switch -XX:MaxInlineSize=n to alter this).

Another optimisation is to look at the invocation count of a method to determine its "hotness" (TODO look up the threshold for hotness) and if a hot method is 320 bytescodes or less (VM-dependent, use VM switch -XX:FreqInlineSize=n to alter) then it will also be inlined even though it is more than 35 bytecodes long.

The log error message when HotSpot cannot inline a hot method is:

<inline_fail reason='hot method too big'/>

which appears after the <call> tag for a <method> in the <phase name="parse"> section of a <task> tag.

This looked like a good place to start making suggestions - highlighting where hot methods failed to be inlined due to their size (specified in the "bytes" attribute of the <method> tag (bytes here means the number of bytecode instructions).

Using the biggest hotspot.log I could generate (from the behemoth that is Eclipse, naturally) I started searching for 'hot method too big' failures and was surprised to see them in many core Java classes. From here I decided to focus on performance critical core classes such as sorting code and this led me to look at java.util.ComparableTimSort

The next step was to write a simple sorting example that I could benchmark using OpenJDK's jmh tool (by Aleksey Shipilev) which I had wanted to try out for some time.

The design for the test method is to populate a 1,000,000 element list with random Integer values and then measuring the time taken to sort the list. I used a seeded java.util.Random to ensure the same elements were inserted into the list each time.

Here is my jmh test class:

import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.Random;
import java.util.concurrent.TimeUnit;

import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.GenerateMicroBenchmark;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;

public class PerfTestSort
{
@GenerateMicroBenchmark
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public void testSort() {
List<Integer> list = new ArrayList<Integer>();

int count = 1000000;

Random seededRandom = new Random(12345678);

for (int i = 0; i < count; i++) {
list.add(seededRandom.nextInt());
}

Collections.sort(list);
}

public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(".*" + PerfTestSort.class.getSimpleName() + ".*")
.forks(1)
.build();

new Runner(opt).run();
}
}

Running the testSort() method once outside of jmh with hotspot.log output enabled I was able to see that in the compile for the method java.util.ComparableTimSort.sort(Object[], int, int) the call to method java.util.ComparableTimSort.mergeAt(int) failed to be inlined for reason 'hot method too big'.

The relevant part of the hotspot.log file is:

I could see that at 356 bytecodes it was only slightly too long for the 320 bytecode limit for a hot method.

I had the idea that perhaps by splitting mergeAt() into 2 smaller methods then this hot code could be inlined.

The next steps were to

1) Unzip the JDK's src.zip

2) Modify the method java.util.ComparableTimSort.mergeAt(int) to produce less than 320 bytecodes

3) Compile the modified ComparableTimSort class

4) Unjar the JRE's lib/rt.jar

5) Replace java/util/ComparableTimSort class with my new version

6) Create a new rt.jar containing the modified class

7) Benchmark my sorting test method using jmh and compare the original rt.jar to the one containing the split mergeAt method

8) Also compare whether the same result could be obtained just by specifying higher values for -XX:MaxInlineSize and -XX:FreqInlineSize

Here is my modified java.util.ComparableTimSort.mergeAt(int)

/**
* Merges the two runs at stack indices i and i+1. Run i must be
* the penultimate or antepenultimate run on the stack. In other words,
* i must be equal to stackSize-2 or stackSize-3.
*
* @param i stack index of the first of the two runs to merge
*/
@SuppressWarnings("unchecked")
private void mergeAt(int i) {
assert stackSize >= 2;
assert i >= 0;
assert i == stackSize - 2 || i == stackSize - 3;

int base1 = runBase[i];
int len1 = runLen[i];
int base2 = runBase[i + 1];
int len2 = runLen[i + 1];
assert len1 > 0 && len2 > 0;
assert base1 + len1 == base2;

/*
* Record the length of the combined runs; if i is the 3rd-last
* run now, also slide over the last run (which isn't involved
* in this merge). The current run (i+1) goes away in any case.
*/
runLen[i] = len1 + len2;
if (i == stackSize - 3) {
runBase[i + 1] = runBase[i + 2];
runLen[i + 1] = runLen[i + 2];
}
stackSize--;

/*
* Find where the first element of run2 goes in run1. Prior elements
* in run1 can be ignored (because they're already in place).
*/
int k = gallopRight((Comparable<Object>) a[base2], a, base1, len1, 0);
assert k >= 0;
base1 += k;
len1 -= k;
if (len1 == 0)
return;

// move remainder to a separate method
mergeAt2(base1, base2, len1, len2, a);
}

@SuppressWarnings("unchecked")
private void mergeAt2(int base1, int base2, int len1, int len2, Object[] a) {
/*
* Find where the last element of run1 goes in run2. Subsequent elements
* in run2 can be ignored (because they're already in place).
*/
len2 = gallopLeft((Comparable<Object>) a[base1 + len1 - 1], a,
base2, len2, len2 - 1);
assert len2 >= 0;
if (len2 == 0)
return;

// Merge remaining runs, using tmp array with min(len1, len2) elements
if (len1 <= len2)
mergeLo(base1, len1, base2, len2);
else
mergeHi(base1, len1, base2, len2);
}

Here are the jmh results using 100 warmup loops and 100 iterations per test result.

1) Original rt.jar and standard VM options

Result : 367027.592 ±(99.9%) 1027.549 us/op
Statistics: (min, avg, max) = (363215.462, 367027.592, 374771.395), stdev = 3029.753
Confidence interval (99.9%): [366000.042, 368055.141]

Benchmark Mode Samples Mean Mean error Units
o.o.j.s.PerfTestSort.testSort avgt 100 367027.592 1027.549 us/op

2) Original rt.jar and -XX:MaxInlineSize=1024 -XX:FreqInlineSize=1024

Result : 362610.675 ±(99.9%) 994.842 us/op
Statistics: (min, avg, max) = (359110.426, 362610.675, 372784.590), stdev = 2933.314
Confidence interval (99.9%): [361615.834, 363605.517]

Benchmark Mode Samples Mean Mean error Units
o.o.j.s.PerfTestSort.testSort avgt 100 362610.675 994.842 us/op

3) Modified rt.jar and standard VM options

Result : 358166.104 ±(99.9%) 882.070 us/op
Statistics: (min, avg, max) = (354702.777, 358166.104, 367035.542), stdev = 2600.803
Confidence interval (99.9%): [357284.034, 359048.174]

Benchmark Mode Samples Mean Mean error Units
o.o.j.s.PerfTestSort.testSort avgt 100 358166.104 882.070 us/op

4) Modified rt.jar and -XX:MaxInlineSize=1024 -XX:FreqInlineSize=1024

Result : 364510.037 ±(99.9%) 1086.795 us/op
Statistics: (min, avg, max) = (361420.264, 364510.037, 373596.578), stdev = 3204.439
Confidence interval (99.9%): [363423.243, 365596.832]

Benchmark Mode Samples Mean Mean error Units
o.o.j.s.PerfTestSort.testSort avgt 100 364510.037 1086.795 us/op

Interpretation of results

The best performance was obtained using the split mergeAt() method with standard VM options (358166 us/op) 2.41% faster than unmodified code.

With the modified code, altering the inlining parameters so that all methods of ComparableTimSort could be inlined led to worse performance (364510 us/op) only 0.68% faster than unmodified code.

With the unmodified rt.jar, increasing the inlining parameters led to better performance (362610 us/op) 1.20% faster than unmodified code on default inlining parameters.

Further work

If these results are correct, show the JITWatch proof that the split methods were inlined (they are) and see if performance can be improved further by splitting other large hot methods in ComparableTimSort (gallopLeft and gallopRight etc.).

Check whether these kind of optimisations are no longer relevant as JDK8 core libraries will be rewritten to use streams and lambdas.

Similar Articles

Cascading OpenJDK builds
Quickly find worst GC pauses in G1 and Parallel GC logs
Identify the Java Thread taking the most CPU
Run an IBM J9 VM in a docker container
DemoFX Part 3
More Bytecode Geekery with JarScan
DemoFX Part 2
OpenJFX Nightly Builds for Linux amd64 and armv6hf for Raspberry Pi
DemoFX Part 1
Add JavaFX support to Azul Systems' Zulu JDK using OpenJFX
JarScan comparison between 8u31 and 8u40 rt.jar
Development Goals for 2015
CPU pinning Java threads with jstack and taskset
The power of JIT inlining
Building OpenJDK 9 and OpenJFX from source on Debian
Building hsdis-amd64.dylib on Mac OSX
Java 7 and Java 8 core methods above the default hot method inlining threshold
Java acronyms
Location of Java applet log on Windows XP
Examination of PerformanceTools using JITWatch
Building hsdis on Linux amd64 on Debian
Update Java plugin used by OSX
Understanding HotSpot logs
Raspberry Pi TFT hack and video glasses, wearable Pi project
Maven setup
Compiling JDK8 lambdas on Eclipse Luna using Ant
Reading and writing bytes between Java and Obj-C
Reference manual index, Java, PHP, JavaScript
Java JIT analysis and code coverage
Builder Pattern
Refactoring old style Java synchronized code with CopyOnWriteArrayList and Collections.unmodifiableList
JavaScript prototype idiom for OO design
Cross platform SWT dispatch loop idiom
[SOLVED] Java 7 update 21 mixed code warning dialog with signed applet
Clear Linux buffers, cache when benchmarking filesystem
JavaFX MediaPlayer crash (Debian amd64) playing mp3 and displaying graphics
Java on Mac OSX 10.8 Safari broken by XProtect.meta.plist whitelist update
Updated Tutorial for JDK8 early access on Raspberry Pi
Using Java jarsigner to check a jar signature
Eclipse can't find import javafx
JavaFX Ensemble sample won't run from Eclipse due to bad server config
Java applet deployment using deployJava.js
deployJava.js fails to install Java when Java not present
jusched.exe still running after Java uninstalled
High performance modulo operation
Applet graphics corruption in Java 7 update 10, 11, 13 on Mac OSX
ncurses type applications in Java on Raspberry Pi (Lanterna Console)
OpenJDK IcedTea plugin java debugging
Raspberry Pi Java Applets - Iceweasel OpenJDK IcedTea plugin (HardFloat)
[SOLVED] Debian ./java: No such file or directory
Objective C for Java Programmers
Java cheat sheet for angle plotting in degrees, radians, and Pi radians
[SOLVED] Eclipse Java autocomplete not working
Java trace memory leaks with hprof and verbose:gc
Java Unsupported major minor version 51.0
Java AWT Graphics2D anti-aliasing in a Java 1.1 compatible way using reflection
Getting started with Java on Raspberry Pi
Using the Java 1.5 ScheduledExecutorService for scheduling repeating tasks
Java wrapper around an external process
Map JDBC types to Java primitive and Object types
java.io.IOException: Too many open files
Low latency Java tips
Select correct SWT jar for your OS and JVM at runtime
Java can an int fit inside a float or a double without loss of precision?
Java concatenate null String with +=
Use 64 bit Sun Java plugin in Firefox on Ubuntu
openSSL convert PEM certificate and import to Java keystore
Java short form array iteration using for-loop
Java for loop syntax to replace Iterator
Java String.split include empty trailing strings
Java variable scope in switch statement
Java defensive programming - compare variable with constant
SWT Canvas plot centred text
FATAL: sorry, too many clients already (PostgreSQL)
Java numerical overflow
Business Insider's UBS Quant puzzle solution
Java return code in Linux shell script
Java primitives in size order
SWT literal ampersand instead of keyboard hotkey
Elegant use of Java's Math.max() to prevent negative numbers
Java remove non alphanumeric characters from String
SWT best practice - single Display multiple Shells
Ubuntu remove OpenJDK and use Sun Java as default
Java can't delete directory
Deadlock when 2 threads write to the same HashMap