Got me thinking

In a not uncommon turn of events I saw a tweet by Aleksey and it sent me off into a JVM daydream and I started wondering what the (bytecode) size distribution of those JDK methods looked like.
Tweet by @shipilev


In my previous blog post I described some enhancements I'd made to the JarScan tool (a program for performing statistical analysis on jar files).
This gave me a framework for methodically plodding through jars and crunching stats on the bytecode contained within so it was nice and easy to add a new study to create a frequency count of method bytecode sizes.
The JarScan tool is available on GitHub as part of the JITWatch project (a tool for analysing the decisions made the the HotSpot Just In Time compiler) and can be downloaded here:
The usage syntax for JarScan is:
JarScan --mode=<mode> [--packages=a,b,c,...] [params] <jars>

A new mode

The new mode is called methodSizeHisto which takes no parameters.
To create the size/frequency (histogram) data do:
./ --mode=methodSizeHisto rt.jar > histoJDK.csv
This produces the following output:
Method size frequencies
(All packages)
Bytecode sizeCountPercent

The results

Wow, there are a lot of methods that are 5 bytes long but the rest looks like a long tail. I'll just throw this into $OFFICE_PRODUCT and plot a chart ...
Much wailing and gnashing of teeth later I thought, hey, JavaFX has got some neat charty stuff so I quickly threw together a histogram plotter:
./ histoJDK.csv

Full JDK bytecode size histograms

OK long tail, large spike at 5 bytes, and thankfully very few really massive methods
JDK full bytecode size histogram
Let's zoom in on methods up to 325 bytes (x86-64 limit for hot method inlining)
JDK bytecode size histogram up to 325 bytes
Zoom in a little more (nothing significant about 100 bytes to the JVM)
JDK bytecode size histogram up to 100 bytes
And finally check up to 35 bytes (methods this size and below are usually inlined)
JDK bytecode size histogram up to 35 bytes

Double checking

Now I'm not entirely happy using rt.jar for bytecode stats because it contains a large number of near identical classes dedicated to i18n as well as whole swathes of packages for Swing, CORBA, and XML parsing which make it a fairly unusual beastie.
It's still a convenient and large class file corpus so a quick check see to if these results still hold is to re-run the histo generation for just the java.* package:
./ --mode=methodSizeHisto --packages=java.* rt.jar > histoJava.csv
Method size frequencies
Bytecode sizeCountPercent

java.* packages bytecode size histograms

./ histoJava.csv
Still a big spike at 5 bytes but at least we don't have those massive (20KB+) methods in the java.* package.
java.* packages full bytecode size histogram
Up to 325 bytes (x86-64 limit for hot method inlining)
java.* packages bytecode size histogram up to 325 bytes
Zoom in a bit
java.* packages bytecode size histogram up to 100 bytes
And finally the ones that should get inlined
java.* packages bytecode size histogram up to 35 bytes

So what's the deal with so many methods of 5 bytes in length?

In both the full JDK and in the java.* sub-packages more than 12% of all methods consist of 5 bytes of bytecode.

Another mode required

To filter all methods of a given length I created another mode for JarScan.
The new mode is called methodLength and requires a --length parameter.
To list all methods which are 5 bytes long
./ --mode=methodLength --length=5 rt.jar > methodsLength5JDK.csv

TLDR: It's all about the accessors

For the full JDK it turns out that around 42% of methods with length 5 are named get*
wc -l methodsLength5JDK.csv
20047 methodsLength5JDK.csv

grep get methodsLength5JDK.csv | wc -l
And for just the java.* packages:
./ --mode=methodLength --length=5 --packages=java.* rt.jar > methodsLength5Java.csv
Around 37% of java.* methods with length 5 are named get*
wc -l methodsLength5Java.csv
3861 methodsLength5Java.csv

grep get methodsLength5Java.csv | wc -l

What does a getter look like?

Picking the getName() method from as an example, you can see that the general pattern is
 public java.lang.String getName();
   descriptor: ()Ljava/lang/String;
   flags: ACC_PUBLIC
     stack=1, locals=1, args_size=1
        0: aload_0
        1: getfield      #18                 // Field name:Ljava/lang/String;
        4: areturn
The aload_0 instruction pushes "this" onto the stack.
The getfield instruction fetches a field from an object.
The return instruction (in this case areturn which returns an object reference).
Each instruction is represented in 1 byte (hence bytecode) and the getfield instruction takes a 2 byte parameter which is an index into the runtime constant pool of the current class.
3 instructions + 2 byte parameter = 5 bytes of bytecode.

Other field accessors

Not all of the 5 byte accesors are named get*
Here is the bytecode for intValue() on class java.lang.Byte:
 public int intValue();
   descriptor: ()I
   flags: ACC_PUBLIC
     stack=1, locals=1, args_size=1
        0: aload_0
        1: getfield      #22                 // Field value:B
        4: ireturn

Further Ideas

There are an interesting number of methods smaller than 5 bytes in the JDK:
1 2885
2 3230
3 1259
4 2382
There are also a few enormous methods:
23075 1
22832 1
17828 1
17441 1
15361 1
14060 1
12087 1
11759 1
10921 1
Feel free to play around with JarScan and let me know if you find anything cool!