Performance optimization
techniques for Java code
Who am I and why should you
        trust me? 
●   Attila-Mihály Balázs
    http://hype-free.blogspot.com/
●   Former malware researcher (”low-level
    guy”)
●   Current Java dev (”high level dude”)
●   Spent the last ~6 monts optimizing a large
    (1 000 000+ LOC) legacy system
●   Will spend the next 6 months on it too (at
    least )
?
Question everything!
What's this about
●   Core principles
●   Demo 1: collections framework
●   Demo 2, 3, 4: synchronization performance
●   Demo 5: ugly code, is it worth it?
●   Demo 6, 7, 8: playing with Strings
●   Conclusions
●   Q&A
What this is not about
●   Selecting efficient algorithms
●   High level optimizations (architectural
    changes)

●   These are important too! (but require more
    effort, and we are going for the quick win
    here)
Core principles
●   Performance is a balence, and endless
    game of shifting bottlenecks, no silver
    bullets here!

                     CPU
                      CPU    Memory
                              Memory
      Your program




                     Disk
                      Disk   Network
                              Network
Perform on all levels!
●   Performance has many levels:
        –   Compiler (JIT): 5 to 6: 100%(1)
        –   Memory: L1/L2 cache, main memory
        –   Disk: cache, RAID, SSD
        –   Network: 10Mbit, 100Mbit, 1000Mbit
●   Until recently we had it easy (performance
    doubled every 18 months)
●   Now we need to do some work
(1) http://java.sun.com/performance/reference/whitepapers/6_performance.html
Core principles
●   Measure, measure, measure! (before,
    during, after).
●   Try using realistic data!
●   Watch out for the Heisenberg effect (more
    on this later)
●   Some things are not intuitive:
        –   Pop-question: if processing 1000
             messages takes 1 second, how long
             does the processing of 1 message take?
Core principles
●   Troughput
●   Latency
●   Thread context, context switching
●   Lock contention
●   Queueing theory
●   Profiling
●   Sampling
Feasibility – ”numbers everyone
        should know” (2)
●   L1 cache reference 0.5 ns
●   Branch mispredict 5 ns
●   L2 cache reference 7 ns
●   Mutex lock/unlock 100 ns
●   Main memory reference 100 ns
●   Compress 1K bytes with Zippy 10,000 ns
●   Send 2K bytes over 1 Gbps network 20,000 ns
●   Read 1 MB sequentially from memory 250,000 ns
●   Round trip within same datacenter 500,000 ns
●   Disk seek 10,000,000 ns
●   Read 1 MB sequentially from network 10,000,000 ns
●   Read 1 MB sequentially from disk 30,000,000 ns
●   Send packet CA->Netherlands->CA 150,000,000 n
 (2) http://research.google.com/people/jeff/stanford-295-talk.pdf
Feasability
●   Amdahl's law: The speedup of a program
    using multiple processors in parallel
    computing is limited by the time needed for
    the sequential fraction of the program.
Course of action
●   Have a clear (written?), measourable goal:
    operation X should take less than 100ms in
     99.9% of the cases
●   Measure (profile)
●   Is the goal met? → The End
●   Optimize hotspots → go to step 2
Tools
●   VisualVM
●   JProfiler
●   YourKit

●   Eclipse TPTP
●   Netbeans Profiler
Demo 1: collections framework
●   Name 3 things wrong with this code:


Vector<String> v1;
…
if (!v1.contains(s)) { v1.add(s); }
Demo 1: collections framework
●   Wrong data structure (list / array instead of
    set), hence slooow performance for large
    data sets (but not for small ones!)
●   Extra synchronization if used by a single
    thread only
●   Not actually thread safe! (only ”exception
    safe”)
Demo 1: lessons
●   Use existing classes
●   Use realistic sample data
●   Thread safety is hard!
●   Heisenberg (observer) effect
Demo 2, 3, 4: synchronization
        performance
●   If I have N units of work and use 4, it must
    be faster than using a single thread, right?
●   What does lock contention look like?
●   What does a ”synchronization train(wreck)”
    look like?
Demo 2, 3, 4: lessons
●   Use existing classes
        –   ReadWriteLock
        –   java.util.concurrent.*
●   Use realistic sample data (too short / too
    long units of work)
●   Sometimes throwing a threadpool at it
    makes it worse!
●   Consider using a private copy of the
    variable for each thread
Demo 5: ugly code, is it worth it?
 ●   Parsing a logfile
Demo 5: lessons
●   Sometimes yes, but always profile first!
Demo 6: String.substring
●   How are strings stored in Java?
Demo 6: Lesson
●   You can look inside the JRE when needed!
Demo 7: repetitive strings
Demo 7: Lessons
●   You shouldn't use String.intern:
        –   Slow
        –   You have to use it everywhere
        –   Needs hand-tuning
●   Use a WeakHashMap for caching (don't
    forget to synchronize!)
●   Use String.equals (not ==)
Demo 8: charsets
–   ASCII
–   ISO-8859-1
–   UTF-8
–   UTF-16
Demo 8: lessons
●   Use UTF-8 where possible
Conclusions
●   Measure twice, cut once
●   Don't trust advice you didn't test! (including
    mine)
●   Most of the time you don't need to sacrifice
    clean code for performant code
Conclusions
●   Slides:
        –   Google Groups
        –   http://hype-free.blogspot.com/
        –   x_at_y_or_z@yahoo.com
●   Source code:
        –   http://code.google.com/p/hype-
              free/source/browse/#svn/trunk/java-
              perfopt-201003
●   Profiler evaluation licenses
Resources
●   https://visualvm.dev.java.net/
●   http://www.ej-technologies.com/
●   http://blog.ej-technologies.com/
●   http://www.yourkit.com/
●   http://www.yourkit.com/docs/index.jsp
●   http://www.yourkit.com/eap/index.jsp
Thank you!

Questions?

Performance optimization techniques for Java code

  • 1.
  • 2.
    Who am Iand why should you trust me?  ● Attila-Mihály Balázs http://hype-free.blogspot.com/ ● Former malware researcher (”low-level guy”) ● Current Java dev (”high level dude”) ● Spent the last ~6 monts optimizing a large (1 000 000+ LOC) legacy system ● Will spend the next 6 months on it too (at least )
  • 3.
  • 4.
    What's this about ● Core principles ● Demo 1: collections framework ● Demo 2, 3, 4: synchronization performance ● Demo 5: ugly code, is it worth it? ● Demo 6, 7, 8: playing with Strings ● Conclusions ● Q&A
  • 5.
    What this isnot about ● Selecting efficient algorithms ● High level optimizations (architectural changes) ● These are important too! (but require more effort, and we are going for the quick win here)
  • 6.
    Core principles ● Performance is a balence, and endless game of shifting bottlenecks, no silver bullets here! CPU CPU Memory Memory Your program Disk Disk Network Network
  • 7.
    Perform on alllevels! ● Performance has many levels: – Compiler (JIT): 5 to 6: 100%(1) – Memory: L1/L2 cache, main memory – Disk: cache, RAID, SSD – Network: 10Mbit, 100Mbit, 1000Mbit ● Until recently we had it easy (performance doubled every 18 months) ● Now we need to do some work (1) http://java.sun.com/performance/reference/whitepapers/6_performance.html
  • 8.
    Core principles ● Measure, measure, measure! (before, during, after). ● Try using realistic data! ● Watch out for the Heisenberg effect (more on this later) ● Some things are not intuitive: – Pop-question: if processing 1000 messages takes 1 second, how long does the processing of 1 message take?
  • 9.
    Core principles ● Troughput ● Latency ● Thread context, context switching ● Lock contention ● Queueing theory ● Profiling ● Sampling
  • 10.
    Feasibility – ”numberseveryone should know” (2) ● L1 cache reference 0.5 ns ● Branch mispredict 5 ns ● L2 cache reference 7 ns ● Mutex lock/unlock 100 ns ● Main memory reference 100 ns ● Compress 1K bytes with Zippy 10,000 ns ● Send 2K bytes over 1 Gbps network 20,000 ns ● Read 1 MB sequentially from memory 250,000 ns ● Round trip within same datacenter 500,000 ns ● Disk seek 10,000,000 ns ● Read 1 MB sequentially from network 10,000,000 ns ● Read 1 MB sequentially from disk 30,000,000 ns ● Send packet CA->Netherlands->CA 150,000,000 n (2) http://research.google.com/people/jeff/stanford-295-talk.pdf
  • 11.
    Feasability ● Amdahl's law: The speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program.
  • 12.
    Course of action ● Have a clear (written?), measourable goal: operation X should take less than 100ms in 99.9% of the cases ● Measure (profile) ● Is the goal met? → The End ● Optimize hotspots → go to step 2
  • 13.
    Tools ● VisualVM ● JProfiler ● YourKit ● Eclipse TPTP ● Netbeans Profiler
  • 14.
    Demo 1: collectionsframework ● Name 3 things wrong with this code: Vector<String> v1; … if (!v1.contains(s)) { v1.add(s); }
  • 15.
    Demo 1: collectionsframework ● Wrong data structure (list / array instead of set), hence slooow performance for large data sets (but not for small ones!) ● Extra synchronization if used by a single thread only ● Not actually thread safe! (only ”exception safe”)
  • 16.
    Demo 1: lessons ● Use existing classes ● Use realistic sample data ● Thread safety is hard! ● Heisenberg (observer) effect
  • 17.
    Demo 2, 3,4: synchronization performance ● If I have N units of work and use 4, it must be faster than using a single thread, right? ● What does lock contention look like? ● What does a ”synchronization train(wreck)” look like?
  • 18.
    Demo 2, 3,4: lessons ● Use existing classes – ReadWriteLock – java.util.concurrent.* ● Use realistic sample data (too short / too long units of work) ● Sometimes throwing a threadpool at it makes it worse! ● Consider using a private copy of the variable for each thread
  • 19.
    Demo 5: uglycode, is it worth it? ● Parsing a logfile
  • 20.
    Demo 5: lessons ● Sometimes yes, but always profile first!
  • 21.
    Demo 6: String.substring ● How are strings stored in Java?
  • 22.
    Demo 6: Lesson ● You can look inside the JRE when needed!
  • 23.
  • 24.
    Demo 7: Lessons ● You shouldn't use String.intern: – Slow – You have to use it everywhere – Needs hand-tuning ● Use a WeakHashMap for caching (don't forget to synchronize!) ● Use String.equals (not ==)
  • 25.
    Demo 8: charsets – ASCII – ISO-8859-1 – UTF-8 – UTF-16
  • 26.
    Demo 8: lessons ● Use UTF-8 where possible
  • 27.
    Conclusions ● Measure twice, cut once ● Don't trust advice you didn't test! (including mine) ● Most of the time you don't need to sacrifice clean code for performant code
  • 28.
    Conclusions ● Slides: – Google Groups – http://hype-free.blogspot.com/ – x_at_y_or_z@yahoo.com ● Source code: – http://code.google.com/p/hype- free/source/browse/#svn/trunk/java- perfopt-201003 ● Profiler evaluation licenses
  • 29.
    Resources ● https://visualvm.dev.java.net/ ● http://www.ej-technologies.com/ ● http://blog.ej-technologies.com/ ● http://www.yourkit.com/ ● http://www.yourkit.com/docs/index.jsp ● http://www.yourkit.com/eap/index.jsp
  • 30.