Optimizing Java Summary
by James Gough; Benjamin J Evans; Chris Newland
Adam Feldscher, I699 Performance Software Design II
Chapter 9. Code Execution on the JVM
JVM
• JVM spec says how JVM implementations should execute code
• Direct execution is faster then interpreted
• Thus, JIT
• Hotspot
• Popular implementation of JVM
• Profile guided optimization
Bytecode Interpretation
• JVM uses the following to track code execution
• Evaluation Stack
• Object Heap
• Local variables
• Each operation (op-code) is one byte, hence Bytecode
• ~200/256 Operations in Java 10
• Some are similar to machine opcodes- add, sub, load, store
• Others invoke methods on interfaces or dynamic methods (lambdas), ect
• Invokevirtual, invokespecial, invokeinterface, invokestatic
• Arithmetic is done in assembly
• Things like invocation need to interact with the virtual machine
“Safe-points”
• JVM needs to stop all user code to perform house-keeping
• Ex: Garbage collection
• All threads are true OS threads, but execute JVM code and user
code
• Safe points are between opcodes
• Harder with JIT, but points are marked
• Stop all code on safe points, then do housekeeping stuff
AOT and JIT
• Ahead of time compilation – C/C++ …
• Need to make conservative choices about available machine instructions for
portability
• Or compile for a specific system
• Best for extreme performance cases, not scalable to lots of architectures
• Just in time compilation – Java …
• Can profile code
• Optimize based on profile data
• Compile for exactly what instructions are available on that machine
• Steals resources from the running program
JIT
• Why not save and export profile data?
• Very dependent on runtime conditions
• Example of high frequency trading jobs report release day
• --why not save a separate profile for that????
• JVM intentionally doesn’t allow this
• Must rebuild profile data from scratch each time
• HotSpot will now allow you to AOT compile java
• Not recommended
HotSpot
• HotSpot
• multithreaded C++ application
• Hotspot JIT
• Basic unit is a method, compile the whole thing
• On Stack Replacement (OSR)
• Used for loops that are “hot”
• Loops in methods not eligible for JIT, but the loop is
• Arg for logging compilations -XX:+LogCompilation
• JITWatch
JIT
• Tiered Compilation
• Level 0: interpreter
• Level 1: C1 with full optimization (no profiling)
• Level 2: C1 with invocation and backedge counters
• Level 3: C1 with full profiling
• Level 4: C2
• Moves through paths, depending on busyness of compilers and
invocation count
• Code Cache
• Stores compiled code, has a fixed size (240MB) that can fill up
• Code unloaded if replaced or contained a bad optimization
Chapter 10. Understanding JIT Compilation
JITWatch
• Open Source
• By one of the book’s authors - Chris Newland
• Objective performance measurements
• Analyzes the JIT compilation log
JITWatch
Sandbox Tunable Parameters
TriView
Code Cache Layout
Speculative Optimization
• Speculative Optimization
• Using “an unproven assumption about code execution” to optimize
• C1
• Won’t engage in speculative optimizations
• C2
• Will use gathered performance counters to determine how to optimize
• Sanity checked later to ensure it improved performance
• Could potentially make things worse
Inlining
• “The Gateway Optimization”
• Take a method call and just put the code here
• Removes call overhead
• Allows developer to write cleaner code
• Sometimes Won’t Inline
• Method is too large
• Call stack too deep
• Not enough Space in Code Cache
• All of these are tunable parameters
Inlining – JIT Watch
Inlining
• Some java built-ins are too large to be inlined
• String toUpper and toLower are both too large - 439 Bytes each
• A little shocking
• Some locales require changing size of array to change case
• ASCII specific version – 69 Bytes
Loop Unrolling
• Happens after Inlining, so the true cost of each method is known
• Jumping back in instructions is expensive
• Shorter the loop, the higher the relative cost
• Unroll short loops to be a series of instructions instead
• Type of iterator matters!
• Long (vs int) will not be unrolled
• Loop against variable MAX can not be unrolled
• And a SafePoint is added each iteration
Escape Analysis
• Test to see if an object escapes a method
• Returned, set globally, ect
• Happens after Inlining
• Can remove heap allocations
• Scalar replacement, effectively makes it a primitive
• Stores value in registers
• Or “Stack Spills” if not enough space
• Only for smaller objects, arrays < 64 elements
Locks
• Escape Analysis can be used to remove locks
• Object doesn’t leave scope, doesn’t need lock
• Can enlarge lock region so only 1 lock is needed
Monomorphic Dispatch
• If a method on an object is called repeatedly, it is most of the
time the same type of object
• Can cache the function rather than having to look it up in the vtable each
time
• If getDate returns a subclass of Date that overrides a method, we
need to change the call
• Must sanity check type, but don’t have to do full lookup
Intrinsics
• CPU Specific optimizations
• EX: java.lang.System.arraycopy()
• Accelerated using ‘vector support’ on the cpu
• EX: Some CPUs support advanced math functions as instructions

Optimizing Java Notes

  • 1.
    Optimizing Java Summary byJames Gough; Benjamin J Evans; Chris Newland Adam Feldscher, I699 Performance Software Design II
  • 2.
    Chapter 9. CodeExecution on the JVM
  • 3.
    JVM • JVM specsays how JVM implementations should execute code • Direct execution is faster then interpreted • Thus, JIT • Hotspot • Popular implementation of JVM • Profile guided optimization
  • 4.
    Bytecode Interpretation • JVMuses the following to track code execution • Evaluation Stack • Object Heap • Local variables • Each operation (op-code) is one byte, hence Bytecode • ~200/256 Operations in Java 10 • Some are similar to machine opcodes- add, sub, load, store • Others invoke methods on interfaces or dynamic methods (lambdas), ect • Invokevirtual, invokespecial, invokeinterface, invokestatic • Arithmetic is done in assembly • Things like invocation need to interact with the virtual machine
  • 5.
    “Safe-points” • JVM needsto stop all user code to perform house-keeping • Ex: Garbage collection • All threads are true OS threads, but execute JVM code and user code • Safe points are between opcodes • Harder with JIT, but points are marked • Stop all code on safe points, then do housekeeping stuff
  • 6.
    AOT and JIT •Ahead of time compilation – C/C++ … • Need to make conservative choices about available machine instructions for portability • Or compile for a specific system • Best for extreme performance cases, not scalable to lots of architectures • Just in time compilation – Java … • Can profile code • Optimize based on profile data • Compile for exactly what instructions are available on that machine • Steals resources from the running program
  • 7.
    JIT • Why notsave and export profile data? • Very dependent on runtime conditions • Example of high frequency trading jobs report release day • --why not save a separate profile for that???? • JVM intentionally doesn’t allow this • Must rebuild profile data from scratch each time • HotSpot will now allow you to AOT compile java • Not recommended
  • 8.
    HotSpot • HotSpot • multithreadedC++ application • Hotspot JIT • Basic unit is a method, compile the whole thing • On Stack Replacement (OSR) • Used for loops that are “hot” • Loops in methods not eligible for JIT, but the loop is • Arg for logging compilations -XX:+LogCompilation • JITWatch
  • 9.
    JIT • Tiered Compilation •Level 0: interpreter • Level 1: C1 with full optimization (no profiling) • Level 2: C1 with invocation and backedge counters • Level 3: C1 with full profiling • Level 4: C2 • Moves through paths, depending on busyness of compilers and invocation count • Code Cache • Stores compiled code, has a fixed size (240MB) that can fill up • Code unloaded if replaced or contained a bad optimization
  • 10.
  • 11.
    JITWatch • Open Source •By one of the book’s authors - Chris Newland • Objective performance measurements • Analyzes the JIT compilation log
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
    Speculative Optimization • SpeculativeOptimization • Using “an unproven assumption about code execution” to optimize • C1 • Won’t engage in speculative optimizations • C2 • Will use gathered performance counters to determine how to optimize • Sanity checked later to ensure it improved performance • Could potentially make things worse
  • 17.
    Inlining • “The GatewayOptimization” • Take a method call and just put the code here • Removes call overhead • Allows developer to write cleaner code • Sometimes Won’t Inline • Method is too large • Call stack too deep • Not enough Space in Code Cache • All of these are tunable parameters
  • 18.
  • 19.
    Inlining • Some javabuilt-ins are too large to be inlined • String toUpper and toLower are both too large - 439 Bytes each • A little shocking • Some locales require changing size of array to change case • ASCII specific version – 69 Bytes
  • 20.
    Loop Unrolling • Happensafter Inlining, so the true cost of each method is known • Jumping back in instructions is expensive • Shorter the loop, the higher the relative cost • Unroll short loops to be a series of instructions instead • Type of iterator matters! • Long (vs int) will not be unrolled • Loop against variable MAX can not be unrolled • And a SafePoint is added each iteration
  • 21.
    Escape Analysis • Testto see if an object escapes a method • Returned, set globally, ect • Happens after Inlining • Can remove heap allocations • Scalar replacement, effectively makes it a primitive • Stores value in registers • Or “Stack Spills” if not enough space • Only for smaller objects, arrays < 64 elements
  • 22.
    Locks • Escape Analysiscan be used to remove locks • Object doesn’t leave scope, doesn’t need lock • Can enlarge lock region so only 1 lock is needed
  • 23.
    Monomorphic Dispatch • Ifa method on an object is called repeatedly, it is most of the time the same type of object • Can cache the function rather than having to look it up in the vtable each time • If getDate returns a subclass of Date that overrides a method, we need to change the call • Must sanity check type, but don’t have to do full lookup
  • 24.
    Intrinsics • CPU Specificoptimizations • EX: java.lang.System.arraycopy() • Accelerated using ‘vector support’ on the cpu • EX: Some CPUs support advanced math functions as instructions