JAREKJAREK
Started on C64 (Basic, 6502), Amiga (68k, C), C++, Java, JS, Scala, Kotlin...
MOTIVATIONMOTIVATION
Mythical slow solution
IS THAT ALL TRUE?IS THAT ALL TRUE?
long time ago
tmp = $fb
Start:
ldx #10; !! decimal
clc
ldy #$0
lda #$1
fib: dex
beq end
sty tmp
tay
adc tmp
jmp fib
end:
tay
long time ago
branch not taken 2 cycles
tmp = $fb
Start:
ldx #10; !! decimal
clc
ldy #$0
lda #$1
fib: dex
beq end
sty tmp
tay
adc tmp
jmp fib
end:
tay
long time ago
branch not taken 2 cycles
branch taken 3(!!!) cycles
tmp = $fb
Start:
ldx #10; !! decimal
clc
ldy #$0
lda #$1
fib: dex
beq end
sty tmp
tay
adc tmp
jmp fib
end:
tay
Rewrite if statement to save a cycle
motorola 68040
branch taken �xed prediction strategy
PERFORMANCE MYTHSPERFORMANCE MYTHS
not precise,
valid only in certain contexts,
valid in the past,
not relevant
COST OF PERFORMANCE MYTHSCOST OF PERFORMANCE MYTHS
Code:
less readable
less �exible
less abstract
Lost focus.
PERFORMANCE IS A GOOD MARKETING TOOLPERFORMANCE IS A GOOD MARKETING TOOL
Marketing my shiny new library
SCALASCALA
Scala combines object-oriented and functional programming in one
concise, high-level language.
Scala is far from metal(s)
ARE WE IMMUNE?ARE WE IMMUNE?
SOME ARTICLESSOME ARTICLES
applicative-functors-vs-monad-composing-performance-in-scala
No More Transformers: High-Performance E�ects in Scalaz 8
ON PERFORMANCE OF THE NEW COLLECTIONS
Scala’s immutable collections can be slow as a snail
...
SCALA DOES NOT MAKE IT EASYSCALA DOES NOT MAKE IT EASY
A given piece of code may and will perform di�erently when
called/used in a di�erent context
WE WILL GO DEEPWE WILL GO DEEP
TOOLSTOOLS
All the benchmarks and numbers in this talk are presented only to
show the tools
DO NOT make any decisions based on those numbers
LET'S TALK ABOUT IMMUTABILITYLET'S TALK ABOUT IMMUTABILITY
object ImmutableCalc {
case class Calculator(total : Int = 0) {
def add(i: Int) :Calculator = this.copy(total = total + i)
def get() : Int = total
}
}
object MutableCalc {
class Calculator {
var total = 0
def add(i: Int) : Unit = total+=i
def get() : Int = total
}
}
INTO THE BYTECODEINTO THE BYTECODE
ls
'ImmutableCalc$Calculator.class'
'ImmutableCalc$Calculator$.class'
ImmutableCalc.class
'ImmutableCalc$.class'k
and ~ 20 other methods
//CODE decompiled using decompile Scala to Java
public final class ImmutableCalc {
public static class Calculator implements Product, Serializable {
private final int total;
public Iterator productElementNames() {
return Product.productElementNames$(this);
}
public int total() {
return this.total;
}
public ImmutableCalc.Calculator add(final int i) {
return this.copy(this.total() + i);
}
public int get() {
return this.total();
}
public ImmutableCalc.Calculator copy(final int total) {
If you do not like IDE...
javap -p -c 'ImmutableCalc$Calculator.class'
public calculator.immuable.ImmutableCalc$Calculator add(int);
Code:
0: aload_0
1: aload_0
2: invokevirtual #31 // Method total:()I
5: iload_1
6: iadd
7: invokevirtual #34 // Method copy:(I)Lcalculator/immuable/ImmutableCalc$Calcula
10: areturn
public calculator.immuable.ImmutableCalc$Calculator add(int);
Code:
0: aload_0 ; stack = [this]
1: aload_0 ; stack = [this,this]
2: invokevirtual #31 ; total() ; stack = [this,this.total]
5: iload_1 ; stack = [this, this.total, i]
6: iadd ; stack = [this, total + i]
7: invokevirtual #34 ; copy(...); stack = [newObject]
10: areturn ;return newObject
https://en.wikipedia.org/wiki/Java_bytecode_instruction_listings
INFOINFO
There is also lot of metadata in .class �les (not only bytecode)
No one needs to read bytecode - unless You work on a compiler
Run it
object CalcRun {
def main(args: Array[String]): Unit = {
var sum = 0L
for ( i <-0 to 100000) {
val calc = ImmutableCalc.Calculator()
sum+=(1 to 10000).foldLeft( calc)((c,i) => c.add(i)).get()
}
println(sum)
}
}
JITJIT
JITJIT
-XX:+PrintCompilation
0,1,2,3,4 - compilation Tier - higher ~ more optimized
% - on stack replacement
made not entrant - do not use this
! - exception handler
s - synchronized
298 485 2 runners.CalcRun$::$anonfun$main$2 (6 bytes)
298 486 2 calculator.immuable.ImmutableCalc$Calculator::add (11 bytes)
298 487 2 calculator.immuable.ImmutableCalc$Calculator::copy (9 bytes)
298 472 1 java.lang.Integer::intValue (5 bytes)
298 473 1 scala.collection.immutable.RangeIterator::hasNext (5 bytes)
298 475 2 java.lang.Integer::<init> (10 bytes)
298 474 1 calculator.immuable.ImmutableCalc$Calculator::total (5 bytes)
298 476 2 java.lang.Integer::valueOf (32 bytes)
...
335 501 % 4 scala.collection.IterableOnceOps::foldLeft @ 13 (43 bytes)
341 499 % 3 scala.collection.IterableOnceOps::foldLeft @ -2 (43 bytes) made not entrant
0,1,2,3,4 - compilation Tier - higher ~ more optimized
% - on stack replacement
made not entrant - do not use this
! - exception handler
s - synchronized
298 485 2 runners.CalcRun$::$anonfun$main$2 (6 bytes)
298 486 2 calculator.immuable.ImmutableCalc$Calculator::add (11 bytes)
298 487 2 calculator.immuable.ImmutableCalc$Calculator::copy (9 bytes)
298 472 1 java.lang.Integer::intValue (5 bytes)
298 473 1 scala.collection.immutable.RangeIterator::hasNext (5 bytes)
298 475 2 java.lang.Integer::<init> (10 bytes)
298 474 1 calculator.immuable.ImmutableCalc$Calculator::total (5 bytes)
298 476 2 java.lang.Integer::valueOf (32 bytes)
...
335 501 % 4 scala.collection.IterableOnceOps::foldLeft @ 13 (43 bytes)
341 499 % 3 scala.collection.IterableOnceOps::foldLeft @ -2 (43 bytes) made not entrant
https://gist.github.com/rednaxelafx/1165804#�le_notes.md
-XX:+PrintInlining
1275 579 4 runners.CalcRun$$$Lambda$3/1967205423::apply$mcVI$sp (9 bytes)
@ 5 runners.CalcRun$::$anonfun$main$1 (56 bytes) inline (hot)
@ 7 calculator.immuable.ImmutableCalc$Calculator$::apply$default
@ 10 calculator.immuable.ImmutableCalc$Calculator::<init> (
@ 6 java.lang.Object::<init> (1 bytes) inline (hot)
@ 10 scala.Product::$init$ (1 bytes) inline (hot)
@ 26 scala.LowPriorityImplicits::intWrapper (2 bytes) inline (hot)
@ 32 scala.runtime.RichInt$::to$extension (15 bytes) inline (hot)
@ 11 scala.collection.immutable.Range$Inclusive::<init> (
@ 4 scala.collection.immutable.Range::<init> (215 bytes) inline
@ 16 scala.collection.immutable.AbstractSeq::<init> (
@ 1 scala.collection.AbstractSeq::<init> (5 bytes) inline (h
@ 1 scala.collection.AbstractIterable::<init> (5 bytes) in
@ 1 java.lang.Object::<init> (1 bytes) inline (hot)
@ 78 scala.collection.immutable.Range::isEmpty (5 bytes) acces
@ 89 scala.collection.immutable.Range::longLength (25 bytes) i
@ 1 scala.collection.immutable.Range::gap (12 bytes) inline
@ 1 scala.collection.immutable.Range::end (5 bytes) access
@ 6 scala.collection.immutable.Range::start (5 bytes) acce
@ 5 scala.collection.immutable.Range::step (5 bytes) accesso
@ 11 scala.collection.immutable.Range::hasStub (18 bytes) in
@ 1 scala.collection.immutable.Range$Inclusive::isInclusive
WHY BOTHER WITH PRINTCOMPILATION,WHY BOTHER WITH PRINTCOMPILATION,
INLININGINLINING
1. check if critical code was in fact JITed
2. sometimes problems with bytecode can be seen (method never
compiled)
3. look for uncommon traps
MEASURINGMEASURING
JMH
jvm warmup
prevent dead code elimination
addSbtPlugin("pl.project13.scala" % "sbt-jmh" % "0.3.7")
VM version: JDK 11.0.6, OpenJDK 64-Bit Server VM, 11.0.6+9-jvmci-20.0-
b02
@Benchmark
def measureWrong: Unit =
Math.log(x)
@Benchmark
def measureRight: Double =
Math.log(x)
[info] JMHSample_08_DeadCode.measureRight avgt 3 9.891 ± 13.465 ns/op
[info] JMHSample_08_DeadCode.measureWrong avgt 3 0.383 ± 0.132 ns/op
@Benchmark
def measureWrong: Unit =
Math.log(x) // in fact same as = {}
Let's measure CalcCost
@Benchmark
def imperative: Int = {
val calc = new MutableCalc.Calculator()
(1 to numOps).foreach(calc.add)
calc.get()
}
@Benchmark
def immutable: Int = {
val calc = ImmutableCalc.Calculator()
(1 to numOps).foldLeft( calc)((c,i) => c.add(i)).get()
}
object ImmutableCalc {
case class Calculator(total : Int = 0) {
def add(i: Int) :Calculator = this.copy(total = total + i)
def get() : Int = total
}
}
object MutableCalc {
class Calculator {
var total = 0
def add(i: Int) : Unit = total+=i
def get() : Int = total
}
}
VM version: JDK 1.8.0_242, OpenJDK 64-Bit Server VM, 25.242-b08
VM version: JDK 1.8.0_242, OpenJDK 64-Bit Server VM, 25.242-b08
[info] Benchmark Mode Cnt Score Error Units
[info] CalcCost.freeMonadScalaZ avgt 5 154761.225 ± 10652.856 ns/op
[info] CalcCost.immutable avgt 5 15393.204 ± 670.565 ns/op
[info] CalcCost.imperative avgt 5 284.435 ± 11.790 ns/op
[info] CalcCost.tfIO avgt 5 123841.734 ± 30096.835 ns/op
VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7
VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7
[info] CalcCost.freeMonadScalaZ avgt 5 130559.311 ± 21298.291 ns/op
[info] CalcCost.immutable avgt 5 3682.014 ± 510.659 ns/op
[info] CalcCost.imperative avgt 5 263.753 ± 8.953 ns/op
[info] CalcCost.tfIO avgt 5 100277.673 ± 4185.455 ns/op
JVM 20.0.0.r11-grl
20.0.0.r11-grl
[info] CalcCost.freeMonadScalaZ avgt 15 147832.984 ± 15674.721 ns/op
[info] CalcCost.immutable avgt 15 601.389 ± 8.249 ns/op
[info] CalcCost.imperative avgt 15 420.588 ± 103.107 ns/op
[info] CalcCost.tfIO avgt 15 121616.405 ± 5716.283 ns/op
INSPECTING THE JVMINSPECTING THE JVM
JavaFlightRecorder - record "events"
Java Mission Control - browse results
The Java Flight Recorder (JFR) is a commercial feature. You can use it
for free on developer desktops/laptops, and for evaluation purposes in
test, development, and production environments. However, to enable
JFR on a production server, you require a commercial license. Using
JMC UI for other purposes on the JDK does not require a commercial
license.
Java Flight Recorder
low overhead,
detailed jvm events,
additional graphs and tools (for jfr) (intellij)
own events, (use for low overhead trace log)
only on some jvms
license - check your jvm version
-XX:+FlightRecorder
-XX:StartFlightRecording=duration=60s,filename=myrecording.j
JVM 8
JVM 8
Graal
Graal
JPSJPS
hint: jps -mlVv (detailed jvm args)
58661 Main
93654 Jps
57464 NailgunRunner
77659 org.eclipse.equinox.launcher_1.3.0.v20140415-2008.jar
80924 Launcher
jstack 91221 (java 8)
"main" #1 prio=5 os_prio=0 tid=0x00007f33b400d800 nid=0x14812 runnable [0x00007f33bdf38000]
java.lang.Thread.State: RUNNABLE
at calculator.immuable.ImmutableCalc$Calculator.<init>(ImmutableCalc.scala:4)
at calculator.immuable.ImmutableCalc$Calculator.copy(ImmutableCalc.scala:4)
at calculator.immuable.ImmutableCalc$Calculator.add(ImmutableCalc.scala:6)
at runners.CalcRun$.$anonfun$main$2(CalcRun.scala:10)
at runners.CalcRun$.$anonfun$main$2$adapted(CalcRun.scala:10)
at runners.CalcRun$$$Lambda$4/801197928.apply(Unknown Source)
at scala.collection.IterableOnceOps.foldLeft(IterableOnce.scala:636)
at scala.collection.IterableOnceOps.foldLeft$(IterableOnce.scala:632)
at scala.collection.AbstractIterable.foldLeft(Iterable.scala:921)
at runners.CalcRun$.$anonfun$main$1(CalcRun.scala:10)
jstack 94038 (graal)
"main" #1 prio=5 os_prio=0 cpu=26791.23ms elapsed=26.89s tid=0x00007fafb0028800 nid=0x13d86 runnable [
java.lang.Thread.State: RUNNABLE
at scala.collection.IterableOnceOps.foldLeft(IterableOnce.scala:635)
at scala.collection.IterableOnceOps.foldLeft$(IterableOnce.scala:632)
at scala.collection.AbstractIterable.foldLeft(Iterable.scala:921)
at runners.CalcRun$.$anonfun$main$1(CalcRun.scala:10)
at runners.CalcRun$$$Lambda$17/0x00000008400d8040.apply$mcVI$sp(Unknown Source)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:190)
at runners.CalcRun$.main(CalcRun.scala:8)
at runners.CalcRun.main(CalcRun.scala)
"Reference Handler" #2 daemon prio=10 os_prio=0 cpu=0.32ms elapsed=26.87s tid=0x00007fafb024c800 nid=
java.lang.Thread.State: RUNNABLE
at java.lang.ref.Reference.waitForReferencePendingList(java.base@11.0.6/Native Method)
at java.lang.ref.Reference.processPendingReferences(java.base@11.0.6/Reference.java:241
at java.lang.ref.Reference$ReferenceHandler.run(java.base@11.0.6/Reference.java:213)
BEWARE OF SAFEPOINTSBEWARE OF SAFEPOINTS
INTO ASSEMBLYINTO ASSEMBLY
We need to go deeper
-XX:+PrintAssembly
[Constants]
# {method} {0x00007f93f0e2fc48} 'add' '(I)Lcalculator/immuable/ImmutableCalc$Calculator;' in
# this: rsi:rsi = 'calculator/immuable/ImmutableCalc$Calculator'
# parm0: rdx = int
# [sp+0x80] (sp of caller)
0x00007f93f1422680: mov 0x8(%rsi),%r10d
0x00007f93f1422684: shl $0x3,%r10
0x00007f93f1422688: cmp %rax,%r10
0x00007f93f142268b: jne 0x00007f93f1045ba0 ; {runtime_call}
0x00007f93f1422691: nopl 0x0(%rax,%rax,1)
0x00007f93f1422699: nopl 0x0(%rax)
[Verified Entry Point]
0x00007f93f14226a0: mov %eax,-0x14000(%rsp)
0x00007f93f14226a7: push %rbp
0x00007f93f14226a8: sub $0x70,%rsp
0x00007f93f14226ac: movabs $0x7f93f0e43db0,%rdi
0x00007f93f14226b6: mov 0x8(%rdi),%eax
0x00007f93f14226b9: add $0x8,%eax
0x00007f93f14226bc: mov %eax,0x8(%rdi)
0x00007f93f14226bf: movabs $0x7f93f0e2fc48,%rdi ; {metadata({method} {0x00007f93f0e2fc48} 'add' '(I
0x00007f93f14226c9: and $0x3ff8,%eax
0x00007f93f14226cf: cmp $0x0,%eax
-XX:PrintAssemblyOptions=intel
0x00007f11f540f880: mov r10d,DWORD PTR [rsi+0x8]
0x00007f11f540f884: shl r10,0x3
0x00007f11f540f888: cmp r10,rax
0x00007f11f540f88b: jne 0x00007f11f5045ba0 ; {runtime_call}
0x00007f11f540f891: nop DWORD PTR [rax+rax*1+0x0]
0x00007f11f540f899: nop DWORD PTR [rax+0x0]
[Verified Entry Point]
0x00007f11f540f8a0: mov DWORD PTR [rsp-0x14000],eax
0x00007f11f540f8a7: push rbp
0x00007f11f540f8a8: sub rsp,0x70
0x00007f11f540f8ac: movabs rdi,0x7f11e3e43db0
0x00007f11f540f8b6: mov eax,DWORD PTR [rdi+0x8]
0x00007f11f540f8b9: add eax,0x8
0x00007f11f540f8bc: mov DWORD PTR [rdi+0x8],eax
0x00007f11f540f8bf: movabs rdi,0x7f11e3e2fc48 ; {metadata({method} {0x00007f11e3e2fc48} 'add' '(I)L
0x00007f11f540f8c9: and eax,0x3ff8
0x00007f11f540f8cf: cmp eax,0x0
0x00007f11f540f8d2: je 0x00007f11f540f92c ;*aload_0
JIT WATCHJIT WATCH
CHECK JSTATCHECK JSTAT
jstat java 8
jstat -gccause 128409 1000 1000
S0 S1 E O M CCS YGC YGCT FGC FGCT CGC CGCT GCT LGCC
0.00 6.25 0.00 1.10 93.80 80.76 59 0.137 0 0.000 - - 0.137
0.00 0.00 56.00 1.10 93.80 80.76 63 0.146 0 0.000 - - 0.146
0.00 6.25 6.00 1.10 93.80 80.76 67 0.155 0 0.000 - - 0.155
6.25 0.00 76.00 1.10 93.80 80.76 70 0.162 0 0.000 - - 0.162
0.00 0.00 78.00 1.10 93.80 80.76 74 0.172 0 0.000 - - 0.172
6.25 0.00 70.00 1.10 93.80 80.76 78 0.181 0 0.000 - - 0.181
6.25 0.00 44.00 1.10 93.80 80.76 82 0.193 0 0.000 - - 0.193
0.00 0.00 22.00 1.10 93.80 80.76 86 0.199 0 0.000 - - 0.199
6.25 0.00 20.00 1.10 93.80 80.76 90 0.205 0 0.000 - - 0.205
0.00 0.00 44.00 1.10 93.80 80.76 95 0.213 0 0.000 - - 0.213
0.00 6.25 58.00 1.10 93.80 80.76 101 0.226 0 0.000 - - 0.226
0.00 0.00 76.00 1.10 93.80 80.76 104 0.239 0 0.000 - - 0.239
0.00 6.25 42.00 1.10 93.80 80.76 107 0.249 0 0.000 - - 0.249
0.00 0.00 30.00 1.10 93.80 80.76 110 0.255 0 0.000 - - 0.255
jstat graal
jstat -gccause 129920 1000 1000
S0 S1 E O M CCS YGC YGCT FGC FGCT CGC CGCT GCT LGCC
0.00 100.00 2.58 0.00 57.25 54.24 1 0.003 0 0.000 0 0.000 0.003
0.00 100.00 2.58 0.00 57.25 54.24 1 0.003 0 0.000 0 0.000 0.003
0.00 100.00 2.58 0.00 57.25 54.24 1 0.003 0 0.000 0 0.000 0.003
0.00 100.00 2.58 0.00 57.25 54.24 1 0.003 0 0.000 0 0.000 0.003
0.00 100.00 2.58 0.00 57.25 54.24 1 0.003 0 0.000 0 0.000 0.003
0.00 100.00 2.58 0.00 57.25 54.24 1 0.003 0 0.000 0 0.000 0.003
0.00 100.00 2.58 0.00 57.25 54.24 1 0.003 0 0.000 0 0.000 0.003
0.00 100.00 2.58 0.00 57.25 54.24 1 0.003 0 0.000 0 0.000 0.003
THERE ARE MORE TOOLSTHERE ARE MORE TOOLS
JPro�ler DynaTrace
IF WE HAVE SOME TIMEIF WE HAVE SOME TIME
LETS TALS ABOUT MONAD TRANSFORMERSLETS TALS ABOUT MONAD TRANSFORMERS
case class User(id: Long, name: String, parentId: Long)
object UserData {
val allUsers = List(
User(1, "Nikt", 8),
User(2, "Ktos", 7),
User(4, "A", 2),
User(7, "B", 4)
)
}
(scalaz used)
object Users {
val allUsers = UserData.allUsers
def findUser(id: Long): OptionT[Task, User] = OptionT.optionT {
Task.delay({
val found: Option[User] = allUsers.find(u => u.id == id)
found
}
)
}
}
val indices: Seq[Int] = (1 to 100).toList
def run = {
val users: Seq[OptionT[Task, User]] = indices.map(id => Users.findUser(id))
val parents = users.map(u => u.flatMap(z => Users.findUser(z.parentId)))
parents.map(u => u.run.unsafePerformSync).count(u => u.isDefined)
}
object StackedMonads {
val indices: Seq[Int] = (1 to 100).toList
def run = {
val users = indices.map ( id => Users.findUser(id))
val parents = users.map ((u: Task[Option[User]]) =>
u.flatMap((ou: Option[User]) => ou match {
case None => Task.delay({None})
case Some(x) => Users.findUser(x.parentId)
}
)
)
parents.map (u=>u.unsafePerformSync).count(u =>u.isDefined)
}
def findUser(id: Long): Task[Option[User]] = {
Task.delay {
val found: Option[User] = allUsers.find(u => u.id == id)
found
}
}
Monad transformer vs simply stacked (jvm 8)
Let's simply use null
object NOMonad {
val indices: Seq[Int] = (1 to 100).toList
def run = {
val users = indices.map(id => Users.findUser(id))
val parents: Seq[User] = users.map(u =>
if (u != null) {
Users.findUser(u.parentId)
} else {
null
}
)
parents.count((u: User) =>u != null)
}
object Users {
val allUsers = UserData.allUsers
def findUser(id: Long): User = {
val found= allUsers.find(u => u.id == id).orNull
found
}
}
wow, nulls are awesome! (jvm 8)
object SimpleDB {
val connection = DB.initializeDB
def run = {
val users = indices.map(id => Users.findUser(id))
val parents: Seq[User] = users.map(u =>
if (u != null) {
Users.findUser(u.parentId)
} else {
null
}
)
parents.count((u: User) => u != null)
}
object Users {
val userQuery = connection.prepareStatement("SELECT * FROM USERS WHERE id= ?")
def findUser(id: Long): User = {
userQuery.clearParameters()
userQuery.setLong(1, id)
val res = userQuery.executeQuery()
Time to DROP DATABASE
Monad Transformer combined with DB
object OptimizedDB {
val connection = DB.initializeDB
def run = {
val count = Users.countParents()
count
}
object Users {
val userQuery = connection.prepareStatement("SELECT " +
"count(*) FROM USERS u1, USERS u2 WHERE u1.id>1 and u1.id < 100 AND u1.parentId = u2.id"
def countParents(): Int = {
userQuery.clearParameters()
val res = userQuery.executeQuery()
try {
if (res.next()) {
res.getInt(1)
} else {
throw new IllegalStateException("no count from DB")
}
Optimized DB (java 8)
Same on graal
jvm 8
[info] MT.monadTR avgt 3 16210.589 ± 14114.784 ns/op
[info] MT.monadTRDB avgt 3 112424.770 ± 109957.046 ns/op
[info] MT.noMonad avgt 3 1910.297 ± 64.699 ns/op
[info] MT.optimizedDB avgt 3 685.030 ± 99.553 ns/op
[info] MT.simpleDB avgt 3 103198.262 ± 36190.463 ns/op
[info] MT.stackedMonads avgt 3 13040.031 ± 733.337 ns/op
jvm 14
[info] MT.monadTR avgt 3 13328.252 ± 9569.020 ns/op
[info] MT.monadTRDB avgt 3 123824.784 ± 148322.320 ns/op
[info] MT.noMonad avgt 3 1934.838 ± 1596.319 ns/op
[info] MT.optimizedDB avgt 3 767.592 ± 49.940 ns/op
[info] MT.simpleDB avgt 3 96992.954 ± 828.014 ns/op
[info] MT.stackedMonads avgt 3 15777.281 ± 7370.754 ns/op
graal 20
[info] MT.monadTR avgt 3 11057.304 ± 4885.176 ns/op
[info] MT.monadTRDB avgt 3 119223.204 ± 109133.489 ns/op
[info] MT.noMonad avgt 3 1529.025 ± 71.307 ns/op
[info] MT.optimizedDB avgt 3 905.659 ± 34.349 ns/op
[info] MT.simpleDB avgt 3 107627.684 ± 6618.677 ns/op
[info] MT.stackedMonads avgt 3 8534.377 ± 739.795 ns/op
DEEPER THAN ASSEMBLYDEEPER THAN ASSEMBLY
Just looping
simple for
val array = Array.ofDim[Long](sz + 1, sz + 1)
@Benchmark
def forBasedLoop: Long = {
var sum: Long = 0
for (i <- 0 to sz) {
for (j <- 0 to sz) {
sum += array(i)(j)
}
}
sum
}
while
@Benchmark
def whileLoopA: Long = {
var sum: Long = 0
var i = 0
while (i < sz) {
var j = 0
while (j < sz) {
sum += array(i)(j)
j = j + 1
}
i = i + 1
}
sum
}
while reversed (and transposed)
@Benchmark
def whileRevLoop: Long = {
var sum: Long = 0
var j = sz
while (j >= 0) {
var i = sz
while (i >= 0) {
sum += array(i)(j)
i = i - 1
}
j = j - 1
}
sum
}
PERFORMANCE COUNTERSPERFORMANCE COUNTERS
jmh:run -prof perf -w 5s -r 5s -wi 3 -i 3 -f 3
.*ArrayLooping.*
[info] 14,650.91 msec task-clock # 0.482 CPUs utilized
[info] 525 context-switches # 0.036 K/sec
[info] 65 cpu-migrations # 0.004 K/sec
[info] 185 page-faults # 0.013 K/sec
[info] 30,370,384,927 cycles # 2.073 GHz (
[info] 285,624,461 stalled-cycles-frontend # 0.94% frontend cycles idle (
[info] 19,572,932,807 stalled-cycles-backend # 64.45% backend cycles idle (
[info] 35,907,154,673 instructions # 1.18 insn per cycle
[info] # 0.55 stalled cycles per insn (
[info] 6,706,376,215 branches # 457.745 M/sec (
[info] 12,810,882 branch-misses # 0.19% of all branches (
[info] 16,163,472,375 L1-dcache-loads # 1103.240 M/sec (
[info] 10,975,494,751 L1-dcache-load-misses # 67.90% of all L1-dcache hits (
[info] <not supported> LLC-loads
[info] <not supported> LLC-load-misses
[info] 175,788,792 L1-icache-loads # 11.998 M/sec (
[info] 7,051,118 L1-icache-load-misses # 4.01% of all L1-icache hits (
[info] 5,302,762,430 dTLB-loads # 361.941 M/sec (
[info] 14,779,238 dTLB-load-misses # 0.28% of all dTLB cache hits (
[info] 245,713 iTLB-loads # 0.017 M/sec (
[info] 1,195,242 iTLB-load-misses # 486.44% of all iTLB cache hits (
[info] 289,031,594 L1-dcache-prefetches # 19.728 M/sec (
while T+R
[info] 14,650.91 msec task-clock # 0.482 CPUs utilized
[info] 525 context-switches # 0.036 K/sec
[info] 65 cpu-migrations # 0.004 K/sec
[info] 185 page-faults # 0.013 K/sec
[info] 30,370,384,927 cycles # 2.073 GHz (
[info] 285,624,461 stalled-cycles-frontend # 0.94% frontend cycles idle (
[info] 19,572,932,807 stalled-cycles-backend # 64.45% backend cycles idle (
[info] 35,907,154,673 instructions # 1.18 insn per cycle
[info] # 0.55 stalled cycles per insn (
[info] 6,706,376,215 branches # 457.745 M/sec (
[info] 12,810,882 branch-misses # 0.19% of all branches (
[info] 16,163,472,375 L1-dcache-loads # 1103.240 M/sec (
[info] 10,975,494,751 L1-dcache-load-misses # 67.90% of all L1-dcache hits (
[info] <not supported> LLC-loads
[info] <not supported> LLC-load-misses
[info] 175,788,792 L1-icache-loads # 11.998 M/sec (
[info] 7,051,118 L1-icache-load-misses # 4.01% of all L1-icache hits (
[info] 5,302,762,430 dTLB-loads # 361.941 M/sec (
[info] 14,779,238 dTLB-load-misses # 0.28% of all dTLB cache hits (
[info] 245,713 iTLB-loads # 0.017 M/sec (
[info] 1,195,242 iTLB-load-misses # 486.44% of all iTLB cache hits (
[info] 289,031,594 L1-dcache-prefetches # 19.728 M/sec (
while T+R
L1
[info] 14,650.91 msec task-clock # 0.482 CPUs utilized
[info] 525 context-switches # 0.036 K/sec
[info] 65 cpu-migrations # 0.004 K/sec
[info] 185 page-faults # 0.013 K/sec
[info] 30,370,384,927 cycles # 2.073 GHz (
[info] 285,624,461 stalled-cycles-frontend # 0.94% frontend cycles idle (
[info] 19,572,932,807 stalled-cycles-backend # 64.45% backend cycles idle (
[info] 35,907,154,673 instructions # 1.18 insn per cycle
[info] # 0.55 stalled cycles per insn (
[info] 6,706,376,215 branches # 457.745 M/sec (
[info] 12,810,882 branch-misses # 0.19% of all branches (
[info] 16,163,472,375 L1-dcache-loads # 1103.240 M/sec (
[info] 10,975,494,751 L1-dcache-load-misses # 67.90% of all L1-dcache hits (
[info] <not supported> LLC-loads
[info] <not supported> LLC-load-misses
[info] 175,788,792 L1-icache-loads # 11.998 M/sec (
[info] 7,051,118 L1-icache-load-misses # 4.01% of all L1-icache hits (
[info] 5,302,762,430 dTLB-loads # 361.941 M/sec (
[info] 14,779,238 dTLB-load-misses # 0.28% of all dTLB cache hits (
[info] 245,713 iTLB-loads # 0.017 M/sec (
[info] 1,195,242 iTLB-load-misses # 486.44% of all iTLB cache hits (
[info] 289,031,594 L1-dcache-prefetches # 19.728 M/sec (
[info] 14,601.12 msec task-clock # 0.481 CPUs utilized
[info] 551 context-switches # 0.038 K/sec
[info] 79 cpu-migrations # 0.005 K/sec
[info] 142 page-faults # 0.010 K/sec
[info] 56,887,530,219 cycles # 3.896 GHz (
[info] 371,071,536 stalled-cycles-frontend # 0.65% frontend cycles idle (
[info] 39,685,104,564 stalled-cycles-backend # 69.76% backend cycles idle (
[info] 85,472,875,315 instructions # 1.50 insn per cycle
[info] # 0.46 stalled cycles per insn (
[info] 7,673,425,909 branches # 525.537 M/sec (
[info] 112,782,902 branch-misses # 1.47% of all branches (
[info] 56,302,951,054 L1-dcache-loads # 3856.070 M/sec (
[info] 7,136,733,184 L1-dcache-load-misses # 12.68% of all L1-dcache hits (
[info] <not supported> LLC-loads
[info] <not supported> LLC-load-misses
[info] 355,107,740 L1-icache-loads # 24.321 M/sec (
[info] 10,933,836 L1-icache-load-misses # 3.08% of all L1-icache hits (
[info] 113,201,590 dTLB-loads # 7.753 M/sec (
[info] 8,907,743 dTLB-load-misses # 7.87% of all dTLB cache hits (
[info] 46,228 iTLB-loads # 0.003 M/sec (
[info] 939,260 iTLB-load-misses # 2031.80% of all iTLB cache hits (
[info] 7,001,888,370 L1-dcache-prefetches # 479.545 M/sec (
normal while
[info] 14,601.12 msec task-clock # 0.481 CPUs utilized
[info] 551 context-switches # 0.038 K/sec
[info] 79 cpu-migrations # 0.005 K/sec
[info] 142 page-faults # 0.010 K/sec
[info] 56,887,530,219 cycles # 3.896 GHz (
[info] 371,071,536 stalled-cycles-frontend # 0.65% frontend cycles idle (
[info] 39,685,104,564 stalled-cycles-backend # 69.76% backend cycles idle (
[info] 85,472,875,315 instructions # 1.50 insn per cycle
[info] # 0.46 stalled cycles per insn (
[info] 7,673,425,909 branches # 525.537 M/sec (
[info] 112,782,902 branch-misses # 1.47% of all branches (
[info] 56,302,951,054 L1-dcache-loads # 3856.070 M/sec (
[info] 7,136,733,184 L1-dcache-load-misses # 12.68% of all L1-dcache hits (
[info] <not supported> LLC-loads
[info] <not supported> LLC-load-misses
[info] 355,107,740 L1-icache-loads # 24.321 M/sec (
[info] 10,933,836 L1-icache-load-misses # 3.08% of all L1-icache hits (
[info] 113,201,590 dTLB-loads # 7.753 M/sec (
[info] 8,907,743 dTLB-load-misses # 7.87% of all dTLB cache hits (
[info] 46,228 iTLB-loads # 0.003 M/sec (
[info] 939,260 iTLB-load-misses # 2031.80% of all iTLB cache hits (
[info] 7,001,888,370 L1-dcache-prefetches # 479.545 M/sec (
normal while
L1
[info] 14,601.12 msec task-clock # 0.481 CPUs utilized
[info] 551 context-switches # 0.038 K/sec
[info] 79 cpu-migrations # 0.005 K/sec
[info] 142 page-faults # 0.010 K/sec
[info] 56,887,530,219 cycles # 3.896 GHz (
[info] 371,071,536 stalled-cycles-frontend # 0.65% frontend cycles idle (
[info] 39,685,104,564 stalled-cycles-backend # 69.76% backend cycles idle (
[info] 85,472,875,315 instructions # 1.50 insn per cycle
[info] # 0.46 stalled cycles per insn (
[info] 7,673,425,909 branches # 525.537 M/sec (
[info] 112,782,902 branch-misses # 1.47% of all branches (
[info] 56,302,951,054 L1-dcache-loads # 3856.070 M/sec (
[info] 7,136,733,184 L1-dcache-load-misses # 12.68% of all L1-dcache hits (
[info] <not supported> LLC-loads
[info] <not supported> LLC-load-misses
[info] 355,107,740 L1-icache-loads # 24.321 M/sec (
[info] 10,933,836 L1-icache-load-misses # 3.08% of all L1-icache hits (
[info] 113,201,590 dTLB-loads # 7.753 M/sec (
[info] 8,907,743 dTLB-load-misses # 7.87% of all dTLB cache hits (
[info] 46,228 iTLB-loads # 0.003 M/sec (
[info] 939,260 iTLB-load-misses # 2031.80% of all iTLB cache hits (
[info] 7,001,888,370 L1-dcache-prefetches # 479.545 M/sec (
NOT COVEREDNOT COVERED
Garbage collectors
Ahead of time compilation
JVM options
BENCHMARKSBENCHMARKS
All benchmarks above were performed unprofessionaly
All benchmarks lie
ON A GOOD BENCHMARKSON A GOOD BENCHMARKS
Benchmark might be ok:
you test it for yourself
you read about the tools tested
you can �nd what causes di�erences
HINTSHINTS
Tools will not magically solve your problems
Tools let you feel more comfortable
Do not expect PrintAssembly or Performance counters will help you in
typical problems
REAL PROBLEMS I HAD ON PRODUCTIONREAL PROBLEMS I HAD ON PRODUCTION
hidden read of whole database in a dirty piece of code
static synchronized
hashCode bugs
jvm bugs (metaspace)
Lot of them could be spotted by using jstack
THANK YOUTHANK YOU

Scala to assembly

  • 2.
    JAREKJAREK Started on C64(Basic, 6502), Amiga (68k, C), C++, Java, JS, Scala, Kotlin...
  • 3.
  • 4.
  • 6.
    IS THAT ALLTRUE?IS THAT ALL TRUE?
  • 7.
    long time ago tmp= $fb Start: ldx #10; !! decimal clc ldy #$0 lda #$1 fib: dex beq end sty tmp tay adc tmp jmp fib end: tay
  • 8.
    long time ago branchnot taken 2 cycles tmp = $fb Start: ldx #10; !! decimal clc ldy #$0 lda #$1 fib: dex beq end sty tmp tay adc tmp jmp fib end: tay
  • 9.
    long time ago branchnot taken 2 cycles branch taken 3(!!!) cycles tmp = $fb Start: ldx #10; !! decimal clc ldy #$0 lda #$1 fib: dex beq end sty tmp tay adc tmp jmp fib end: tay
  • 10.
    Rewrite if statementto save a cycle
  • 11.
    motorola 68040 branch taken�xed prediction strategy
  • 14.
    PERFORMANCE MYTHSPERFORMANCE MYTHS notprecise, valid only in certain contexts, valid in the past, not relevant
  • 15.
    COST OF PERFORMANCEMYTHSCOST OF PERFORMANCE MYTHS Code: less readable less �exible less abstract Lost focus.
  • 16.
    PERFORMANCE IS AGOOD MARKETING TOOLPERFORMANCE IS A GOOD MARKETING TOOL
  • 17.
    Marketing my shinynew library
  • 18.
    SCALASCALA Scala combines object-orientedand functional programming in one concise, high-level language.
  • 19.
    Scala is farfrom metal(s)
  • 20.
  • 21.
    SOME ARTICLESSOME ARTICLES applicative-functors-vs-monad-composing-performance-in-scala NoMore Transformers: High-Performance E�ects in Scalaz 8 ON PERFORMANCE OF THE NEW COLLECTIONS Scala’s immutable collections can be slow as a snail ...
  • 22.
    SCALA DOES NOTMAKE IT EASYSCALA DOES NOT MAKE IT EASY
  • 33.
    A given pieceof code may and will perform di�erently when called/used in a di�erent context
  • 34.
    WE WILL GODEEPWE WILL GO DEEP
  • 35.
    TOOLSTOOLS All the benchmarksand numbers in this talk are presented only to show the tools DO NOT make any decisions based on those numbers
  • 36.
    LET'S TALK ABOUTIMMUTABILITYLET'S TALK ABOUT IMMUTABILITY
  • 37.
    object ImmutableCalc { caseclass Calculator(total : Int = 0) { def add(i: Int) :Calculator = this.copy(total = total + i) def get() : Int = total } }
  • 38.
    object MutableCalc { classCalculator { var total = 0 def add(i: Int) : Unit = total+=i def get() : Int = total } }
  • 39.
  • 40.
  • 41.
    and ~ 20other methods //CODE decompiled using decompile Scala to Java public final class ImmutableCalc { public static class Calculator implements Product, Serializable { private final int total; public Iterator productElementNames() { return Product.productElementNames$(this); } public int total() { return this.total; } public ImmutableCalc.Calculator add(final int i) { return this.copy(this.total() + i); } public int get() { return this.total(); } public ImmutableCalc.Calculator copy(final int total) {
  • 42.
    If you donot like IDE...
  • 43.
    javap -p -c'ImmutableCalc$Calculator.class' public calculator.immuable.ImmutableCalc$Calculator add(int); Code: 0: aload_0 1: aload_0 2: invokevirtual #31 // Method total:()I 5: iload_1 6: iadd 7: invokevirtual #34 // Method copy:(I)Lcalculator/immuable/ImmutableCalc$Calcula 10: areturn
  • 44.
    public calculator.immuable.ImmutableCalc$Calculator add(int); Code: 0:aload_0 ; stack = [this] 1: aload_0 ; stack = [this,this] 2: invokevirtual #31 ; total() ; stack = [this,this.total] 5: iload_1 ; stack = [this, this.total, i] 6: iadd ; stack = [this, total + i] 7: invokevirtual #34 ; copy(...); stack = [newObject] 10: areturn ;return newObject https://en.wikipedia.org/wiki/Java_bytecode_instruction_listings
  • 45.
    INFOINFO There is alsolot of metadata in .class �les (not only bytecode) No one needs to read bytecode - unless You work on a compiler
  • 46.
    Run it object CalcRun{ def main(args: Array[String]): Unit = { var sum = 0L for ( i <-0 to 100000) { val calc = ImmutableCalc.Calculator() sum+=(1 to 10000).foldLeft( calc)((c,i) => c.add(i)).get() } println(sum) } }
  • 47.
  • 48.
  • 49.
    0,1,2,3,4 - compilationTier - higher ~ more optimized % - on stack replacement made not entrant - do not use this ! - exception handler s - synchronized 298 485 2 runners.CalcRun$::$anonfun$main$2 (6 bytes) 298 486 2 calculator.immuable.ImmutableCalc$Calculator::add (11 bytes) 298 487 2 calculator.immuable.ImmutableCalc$Calculator::copy (9 bytes) 298 472 1 java.lang.Integer::intValue (5 bytes) 298 473 1 scala.collection.immutable.RangeIterator::hasNext (5 bytes) 298 475 2 java.lang.Integer::<init> (10 bytes) 298 474 1 calculator.immuable.ImmutableCalc$Calculator::total (5 bytes) 298 476 2 java.lang.Integer::valueOf (32 bytes) ... 335 501 % 4 scala.collection.IterableOnceOps::foldLeft @ 13 (43 bytes) 341 499 % 3 scala.collection.IterableOnceOps::foldLeft @ -2 (43 bytes) made not entrant
  • 50.
    0,1,2,3,4 - compilationTier - higher ~ more optimized % - on stack replacement made not entrant - do not use this ! - exception handler s - synchronized 298 485 2 runners.CalcRun$::$anonfun$main$2 (6 bytes) 298 486 2 calculator.immuable.ImmutableCalc$Calculator::add (11 bytes) 298 487 2 calculator.immuable.ImmutableCalc$Calculator::copy (9 bytes) 298 472 1 java.lang.Integer::intValue (5 bytes) 298 473 1 scala.collection.immutable.RangeIterator::hasNext (5 bytes) 298 475 2 java.lang.Integer::<init> (10 bytes) 298 474 1 calculator.immuable.ImmutableCalc$Calculator::total (5 bytes) 298 476 2 java.lang.Integer::valueOf (32 bytes) ... 335 501 % 4 scala.collection.IterableOnceOps::foldLeft @ 13 (43 bytes) 341 499 % 3 scala.collection.IterableOnceOps::foldLeft @ -2 (43 bytes) made not entrant https://gist.github.com/rednaxelafx/1165804#�le_notes.md
  • 51.
    -XX:+PrintInlining 1275 579 4runners.CalcRun$$$Lambda$3/1967205423::apply$mcVI$sp (9 bytes) @ 5 runners.CalcRun$::$anonfun$main$1 (56 bytes) inline (hot) @ 7 calculator.immuable.ImmutableCalc$Calculator$::apply$default @ 10 calculator.immuable.ImmutableCalc$Calculator::<init> ( @ 6 java.lang.Object::<init> (1 bytes) inline (hot) @ 10 scala.Product::$init$ (1 bytes) inline (hot) @ 26 scala.LowPriorityImplicits::intWrapper (2 bytes) inline (hot) @ 32 scala.runtime.RichInt$::to$extension (15 bytes) inline (hot) @ 11 scala.collection.immutable.Range$Inclusive::<init> ( @ 4 scala.collection.immutable.Range::<init> (215 bytes) inline @ 16 scala.collection.immutable.AbstractSeq::<init> ( @ 1 scala.collection.AbstractSeq::<init> (5 bytes) inline (h @ 1 scala.collection.AbstractIterable::<init> (5 bytes) in @ 1 java.lang.Object::<init> (1 bytes) inline (hot) @ 78 scala.collection.immutable.Range::isEmpty (5 bytes) acces @ 89 scala.collection.immutable.Range::longLength (25 bytes) i @ 1 scala.collection.immutable.Range::gap (12 bytes) inline @ 1 scala.collection.immutable.Range::end (5 bytes) access @ 6 scala.collection.immutable.Range::start (5 bytes) acce @ 5 scala.collection.immutable.Range::step (5 bytes) accesso @ 11 scala.collection.immutable.Range::hasStub (18 bytes) in @ 1 scala.collection.immutable.Range$Inclusive::isInclusive
  • 52.
    WHY BOTHER WITHPRINTCOMPILATION,WHY BOTHER WITH PRINTCOMPILATION, INLININGINLINING 1. check if critical code was in fact JITed 2. sometimes problems with bytecode can be seen (method never compiled) 3. look for uncommon traps
  • 53.
    MEASURINGMEASURING JMH jvm warmup prevent deadcode elimination addSbtPlugin("pl.project13.scala" % "sbt-jmh" % "0.3.7")
  • 54.
    VM version: JDK11.0.6, OpenJDK 64-Bit Server VM, 11.0.6+9-jvmci-20.0- b02 @Benchmark def measureWrong: Unit = Math.log(x) @Benchmark def measureRight: Double = Math.log(x) [info] JMHSample_08_DeadCode.measureRight avgt 3 9.891 ± 13.465 ns/op [info] JMHSample_08_DeadCode.measureWrong avgt 3 0.383 ± 0.132 ns/op
  • 55.
    @Benchmark def measureWrong: Unit= Math.log(x) // in fact same as = {}
  • 56.
  • 57.
    @Benchmark def imperative: Int= { val calc = new MutableCalc.Calculator() (1 to numOps).foreach(calc.add) calc.get() } @Benchmark def immutable: Int = { val calc = ImmutableCalc.Calculator() (1 to numOps).foldLeft( calc)((c,i) => c.add(i)).get() }
  • 58.
    object ImmutableCalc { caseclass Calculator(total : Int = 0) { def add(i: Int) :Calculator = this.copy(total = total + i) def get() : Int = total } } object MutableCalc { class Calculator { var total = 0 def add(i: Int) : Unit = total+=i def get() : Int = total } }
  • 59.
    VM version: JDK1.8.0_242, OpenJDK 64-Bit Server VM, 25.242-b08
  • 60.
    VM version: JDK1.8.0_242, OpenJDK 64-Bit Server VM, 25.242-b08 [info] Benchmark Mode Cnt Score Error Units [info] CalcCost.freeMonadScalaZ avgt 5 154761.225 ± 10652.856 ns/op [info] CalcCost.immutable avgt 5 15393.204 ± 670.565 ns/op [info] CalcCost.imperative avgt 5 284.435 ± 11.790 ns/op [info] CalcCost.tfIO avgt 5 123841.734 ± 30096.835 ns/op
  • 61.
    VM version: JDK14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7
  • 62.
    VM version: JDK14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7 [info] CalcCost.freeMonadScalaZ avgt 5 130559.311 ± 21298.291 ns/op [info] CalcCost.immutable avgt 5 3682.014 ± 510.659 ns/op [info] CalcCost.imperative avgt 5 263.753 ± 8.953 ns/op [info] CalcCost.tfIO avgt 5 100277.673 ± 4185.455 ns/op
  • 63.
  • 64.
    20.0.0.r11-grl [info] CalcCost.freeMonadScalaZ avgt15 147832.984 ± 15674.721 ns/op [info] CalcCost.immutable avgt 15 601.389 ± 8.249 ns/op [info] CalcCost.imperative avgt 15 420.588 ± 103.107 ns/op [info] CalcCost.tfIO avgt 15 121616.405 ± 5716.283 ns/op
  • 65.
  • 66.
    JavaFlightRecorder - record"events" Java Mission Control - browse results
  • 67.
    The Java FlightRecorder (JFR) is a commercial feature. You can use it for free on developer desktops/laptops, and for evaluation purposes in test, development, and production environments. However, to enable JFR on a production server, you require a commercial license. Using JMC UI for other purposes on the JDK does not require a commercial license.
  • 68.
    Java Flight Recorder lowoverhead, detailed jvm events, additional graphs and tools (for jfr) (intellij) own events, (use for low overhead trace log) only on some jvms license - check your jvm version -XX:+FlightRecorder -XX:StartFlightRecording=duration=60s,filename=myrecording.j
  • 69.
  • 70.
  • 71.
  • 72.
  • 73.
    JPSJPS hint: jps -mlVv(detailed jvm args) 58661 Main 93654 Jps 57464 NailgunRunner 77659 org.eclipse.equinox.launcher_1.3.0.v20140415-2008.jar 80924 Launcher
  • 74.
    jstack 91221 (java8) "main" #1 prio=5 os_prio=0 tid=0x00007f33b400d800 nid=0x14812 runnable [0x00007f33bdf38000] java.lang.Thread.State: RUNNABLE at calculator.immuable.ImmutableCalc$Calculator.<init>(ImmutableCalc.scala:4) at calculator.immuable.ImmutableCalc$Calculator.copy(ImmutableCalc.scala:4) at calculator.immuable.ImmutableCalc$Calculator.add(ImmutableCalc.scala:6) at runners.CalcRun$.$anonfun$main$2(CalcRun.scala:10) at runners.CalcRun$.$anonfun$main$2$adapted(CalcRun.scala:10) at runners.CalcRun$$$Lambda$4/801197928.apply(Unknown Source) at scala.collection.IterableOnceOps.foldLeft(IterableOnce.scala:636) at scala.collection.IterableOnceOps.foldLeft$(IterableOnce.scala:632) at scala.collection.AbstractIterable.foldLeft(Iterable.scala:921) at runners.CalcRun$.$anonfun$main$1(CalcRun.scala:10)
  • 75.
    jstack 94038 (graal) "main"#1 prio=5 os_prio=0 cpu=26791.23ms elapsed=26.89s tid=0x00007fafb0028800 nid=0x13d86 runnable [ java.lang.Thread.State: RUNNABLE at scala.collection.IterableOnceOps.foldLeft(IterableOnce.scala:635) at scala.collection.IterableOnceOps.foldLeft$(IterableOnce.scala:632) at scala.collection.AbstractIterable.foldLeft(Iterable.scala:921) at runners.CalcRun$.$anonfun$main$1(CalcRun.scala:10) at runners.CalcRun$$$Lambda$17/0x00000008400d8040.apply$mcVI$sp(Unknown Source) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:190) at runners.CalcRun$.main(CalcRun.scala:8) at runners.CalcRun.main(CalcRun.scala) "Reference Handler" #2 daemon prio=10 os_prio=0 cpu=0.32ms elapsed=26.87s tid=0x00007fafb024c800 nid= java.lang.Thread.State: RUNNABLE at java.lang.ref.Reference.waitForReferencePendingList(java.base@11.0.6/Native Method) at java.lang.ref.Reference.processPendingReferences(java.base@11.0.6/Reference.java:241 at java.lang.ref.Reference$ReferenceHandler.run(java.base@11.0.6/Reference.java:213)
  • 76.
  • 77.
  • 78.
  • 79.
    [Constants] # {method} {0x00007f93f0e2fc48}'add' '(I)Lcalculator/immuable/ImmutableCalc$Calculator;' in # this: rsi:rsi = 'calculator/immuable/ImmutableCalc$Calculator' # parm0: rdx = int # [sp+0x80] (sp of caller) 0x00007f93f1422680: mov 0x8(%rsi),%r10d 0x00007f93f1422684: shl $0x3,%r10 0x00007f93f1422688: cmp %rax,%r10 0x00007f93f142268b: jne 0x00007f93f1045ba0 ; {runtime_call} 0x00007f93f1422691: nopl 0x0(%rax,%rax,1) 0x00007f93f1422699: nopl 0x0(%rax) [Verified Entry Point] 0x00007f93f14226a0: mov %eax,-0x14000(%rsp) 0x00007f93f14226a7: push %rbp 0x00007f93f14226a8: sub $0x70,%rsp 0x00007f93f14226ac: movabs $0x7f93f0e43db0,%rdi 0x00007f93f14226b6: mov 0x8(%rdi),%eax 0x00007f93f14226b9: add $0x8,%eax 0x00007f93f14226bc: mov %eax,0x8(%rdi) 0x00007f93f14226bf: movabs $0x7f93f0e2fc48,%rdi ; {metadata({method} {0x00007f93f0e2fc48} 'add' '(I 0x00007f93f14226c9: and $0x3ff8,%eax 0x00007f93f14226cf: cmp $0x0,%eax
  • 80.
    -XX:PrintAssemblyOptions=intel 0x00007f11f540f880: mov r10d,DWORDPTR [rsi+0x8] 0x00007f11f540f884: shl r10,0x3 0x00007f11f540f888: cmp r10,rax 0x00007f11f540f88b: jne 0x00007f11f5045ba0 ; {runtime_call} 0x00007f11f540f891: nop DWORD PTR [rax+rax*1+0x0] 0x00007f11f540f899: nop DWORD PTR [rax+0x0] [Verified Entry Point] 0x00007f11f540f8a0: mov DWORD PTR [rsp-0x14000],eax 0x00007f11f540f8a7: push rbp 0x00007f11f540f8a8: sub rsp,0x70 0x00007f11f540f8ac: movabs rdi,0x7f11e3e43db0 0x00007f11f540f8b6: mov eax,DWORD PTR [rdi+0x8] 0x00007f11f540f8b9: add eax,0x8 0x00007f11f540f8bc: mov DWORD PTR [rdi+0x8],eax 0x00007f11f540f8bf: movabs rdi,0x7f11e3e2fc48 ; {metadata({method} {0x00007f11e3e2fc48} 'add' '(I)L 0x00007f11f540f8c9: and eax,0x3ff8 0x00007f11f540f8cf: cmp eax,0x0 0x00007f11f540f8d2: je 0x00007f11f540f92c ;*aload_0
  • 81.
  • 85.
  • 86.
    jstat java 8 jstat-gccause 128409 1000 1000 S0 S1 E O M CCS YGC YGCT FGC FGCT CGC CGCT GCT LGCC 0.00 6.25 0.00 1.10 93.80 80.76 59 0.137 0 0.000 - - 0.137 0.00 0.00 56.00 1.10 93.80 80.76 63 0.146 0 0.000 - - 0.146 0.00 6.25 6.00 1.10 93.80 80.76 67 0.155 0 0.000 - - 0.155 6.25 0.00 76.00 1.10 93.80 80.76 70 0.162 0 0.000 - - 0.162 0.00 0.00 78.00 1.10 93.80 80.76 74 0.172 0 0.000 - - 0.172 6.25 0.00 70.00 1.10 93.80 80.76 78 0.181 0 0.000 - - 0.181 6.25 0.00 44.00 1.10 93.80 80.76 82 0.193 0 0.000 - - 0.193 0.00 0.00 22.00 1.10 93.80 80.76 86 0.199 0 0.000 - - 0.199 6.25 0.00 20.00 1.10 93.80 80.76 90 0.205 0 0.000 - - 0.205 0.00 0.00 44.00 1.10 93.80 80.76 95 0.213 0 0.000 - - 0.213 0.00 6.25 58.00 1.10 93.80 80.76 101 0.226 0 0.000 - - 0.226 0.00 0.00 76.00 1.10 93.80 80.76 104 0.239 0 0.000 - - 0.239 0.00 6.25 42.00 1.10 93.80 80.76 107 0.249 0 0.000 - - 0.249 0.00 0.00 30.00 1.10 93.80 80.76 110 0.255 0 0.000 - - 0.255
  • 87.
    jstat graal jstat -gccause129920 1000 1000 S0 S1 E O M CCS YGC YGCT FGC FGCT CGC CGCT GCT LGCC 0.00 100.00 2.58 0.00 57.25 54.24 1 0.003 0 0.000 0 0.000 0.003 0.00 100.00 2.58 0.00 57.25 54.24 1 0.003 0 0.000 0 0.000 0.003 0.00 100.00 2.58 0.00 57.25 54.24 1 0.003 0 0.000 0 0.000 0.003 0.00 100.00 2.58 0.00 57.25 54.24 1 0.003 0 0.000 0 0.000 0.003 0.00 100.00 2.58 0.00 57.25 54.24 1 0.003 0 0.000 0 0.000 0.003 0.00 100.00 2.58 0.00 57.25 54.24 1 0.003 0 0.000 0 0.000 0.003 0.00 100.00 2.58 0.00 57.25 54.24 1 0.003 0 0.000 0 0.000 0.003 0.00 100.00 2.58 0.00 57.25 54.24 1 0.003 0 0.000 0 0.000 0.003
  • 88.
    THERE ARE MORETOOLSTHERE ARE MORE TOOLS JPro�ler DynaTrace
  • 89.
    IF WE HAVESOME TIMEIF WE HAVE SOME TIME
  • 90.
    LETS TALS ABOUTMONAD TRANSFORMERSLETS TALS ABOUT MONAD TRANSFORMERS
  • 91.
    case class User(id:Long, name: String, parentId: Long) object UserData { val allUsers = List( User(1, "Nikt", 8), User(2, "Ktos", 7), User(4, "A", 2), User(7, "B", 4) ) }
  • 92.
    (scalaz used) object Users{ val allUsers = UserData.allUsers def findUser(id: Long): OptionT[Task, User] = OptionT.optionT { Task.delay({ val found: Option[User] = allUsers.find(u => u.id == id) found } ) } } val indices: Seq[Int] = (1 to 100).toList def run = { val users: Seq[OptionT[Task, User]] = indices.map(id => Users.findUser(id)) val parents = users.map(u => u.flatMap(z => Users.findUser(z.parentId))) parents.map(u => u.run.unsafePerformSync).count(u => u.isDefined) }
  • 93.
    object StackedMonads { valindices: Seq[Int] = (1 to 100).toList def run = { val users = indices.map ( id => Users.findUser(id)) val parents = users.map ((u: Task[Option[User]]) => u.flatMap((ou: Option[User]) => ou match { case None => Task.delay({None}) case Some(x) => Users.findUser(x.parentId) } ) ) parents.map (u=>u.unsafePerformSync).count(u =>u.isDefined) } def findUser(id: Long): Task[Option[User]] = { Task.delay { val found: Option[User] = allUsers.find(u => u.id == id) found } }
  • 94.
    Monad transformer vssimply stacked (jvm 8)
  • 95.
    Let's simply usenull object NOMonad { val indices: Seq[Int] = (1 to 100).toList def run = { val users = indices.map(id => Users.findUser(id)) val parents: Seq[User] = users.map(u => if (u != null) { Users.findUser(u.parentId) } else { null } ) parents.count((u: User) =>u != null) } object Users { val allUsers = UserData.allUsers def findUser(id: Long): User = { val found= allUsers.find(u => u.id == id).orNull found } }
  • 96.
    wow, nulls areawesome! (jvm 8)
  • 97.
    object SimpleDB { valconnection = DB.initializeDB def run = { val users = indices.map(id => Users.findUser(id)) val parents: Seq[User] = users.map(u => if (u != null) { Users.findUser(u.parentId) } else { null } ) parents.count((u: User) => u != null) } object Users { val userQuery = connection.prepareStatement("SELECT * FROM USERS WHERE id= ?") def findUser(id: Long): User = { userQuery.clearParameters() userQuery.setLong(1, id) val res = userQuery.executeQuery()
  • 98.
    Time to DROPDATABASE
  • 99.
  • 100.
    object OptimizedDB { valconnection = DB.initializeDB def run = { val count = Users.countParents() count } object Users { val userQuery = connection.prepareStatement("SELECT " + "count(*) FROM USERS u1, USERS u2 WHERE u1.id>1 and u1.id < 100 AND u1.parentId = u2.id" def countParents(): Int = { userQuery.clearParameters() val res = userQuery.executeQuery() try { if (res.next()) { res.getInt(1) } else { throw new IllegalStateException("no count from DB") }
  • 101.
  • 102.
  • 103.
    jvm 8 [info] MT.monadTRavgt 3 16210.589 ± 14114.784 ns/op [info] MT.monadTRDB avgt 3 112424.770 ± 109957.046 ns/op [info] MT.noMonad avgt 3 1910.297 ± 64.699 ns/op [info] MT.optimizedDB avgt 3 685.030 ± 99.553 ns/op [info] MT.simpleDB avgt 3 103198.262 ± 36190.463 ns/op [info] MT.stackedMonads avgt 3 13040.031 ± 733.337 ns/op
  • 104.
    jvm 14 [info] MT.monadTRavgt 3 13328.252 ± 9569.020 ns/op [info] MT.monadTRDB avgt 3 123824.784 ± 148322.320 ns/op [info] MT.noMonad avgt 3 1934.838 ± 1596.319 ns/op [info] MT.optimizedDB avgt 3 767.592 ± 49.940 ns/op [info] MT.simpleDB avgt 3 96992.954 ± 828.014 ns/op [info] MT.stackedMonads avgt 3 15777.281 ± 7370.754 ns/op
  • 105.
    graal 20 [info] MT.monadTRavgt 3 11057.304 ± 4885.176 ns/op [info] MT.monadTRDB avgt 3 119223.204 ± 109133.489 ns/op [info] MT.noMonad avgt 3 1529.025 ± 71.307 ns/op [info] MT.optimizedDB avgt 3 905.659 ± 34.349 ns/op [info] MT.simpleDB avgt 3 107627.684 ± 6618.677 ns/op [info] MT.stackedMonads avgt 3 8534.377 ± 739.795 ns/op
  • 106.
    DEEPER THAN ASSEMBLYDEEPERTHAN ASSEMBLY Just looping
  • 107.
    simple for val array= Array.ofDim[Long](sz + 1, sz + 1) @Benchmark def forBasedLoop: Long = { var sum: Long = 0 for (i <- 0 to sz) { for (j <- 0 to sz) { sum += array(i)(j) } } sum }
  • 108.
    while @Benchmark def whileLoopA: Long= { var sum: Long = 0 var i = 0 while (i < sz) { var j = 0 while (j < sz) { sum += array(i)(j) j = j + 1 } i = i + 1 } sum }
  • 109.
    while reversed (andtransposed) @Benchmark def whileRevLoop: Long = { var sum: Long = 0 var j = sz while (j >= 0) { var i = sz while (i >= 0) { sum += array(i)(j) i = i - 1 } j = j - 1 } sum }
  • 110.
    PERFORMANCE COUNTERSPERFORMANCE COUNTERS jmh:run-prof perf -w 5s -r 5s -wi 3 -i 3 -f 3 .*ArrayLooping.*
  • 113.
    [info] 14,650.91 msectask-clock # 0.482 CPUs utilized [info] 525 context-switches # 0.036 K/sec [info] 65 cpu-migrations # 0.004 K/sec [info] 185 page-faults # 0.013 K/sec [info] 30,370,384,927 cycles # 2.073 GHz ( [info] 285,624,461 stalled-cycles-frontend # 0.94% frontend cycles idle ( [info] 19,572,932,807 stalled-cycles-backend # 64.45% backend cycles idle ( [info] 35,907,154,673 instructions # 1.18 insn per cycle [info] # 0.55 stalled cycles per insn ( [info] 6,706,376,215 branches # 457.745 M/sec ( [info] 12,810,882 branch-misses # 0.19% of all branches ( [info] 16,163,472,375 L1-dcache-loads # 1103.240 M/sec ( [info] 10,975,494,751 L1-dcache-load-misses # 67.90% of all L1-dcache hits ( [info] <not supported> LLC-loads [info] <not supported> LLC-load-misses [info] 175,788,792 L1-icache-loads # 11.998 M/sec ( [info] 7,051,118 L1-icache-load-misses # 4.01% of all L1-icache hits ( [info] 5,302,762,430 dTLB-loads # 361.941 M/sec ( [info] 14,779,238 dTLB-load-misses # 0.28% of all dTLB cache hits ( [info] 245,713 iTLB-loads # 0.017 M/sec ( [info] 1,195,242 iTLB-load-misses # 486.44% of all iTLB cache hits ( [info] 289,031,594 L1-dcache-prefetches # 19.728 M/sec (
  • 114.
    while T+R [info] 14,650.91msec task-clock # 0.482 CPUs utilized [info] 525 context-switches # 0.036 K/sec [info] 65 cpu-migrations # 0.004 K/sec [info] 185 page-faults # 0.013 K/sec [info] 30,370,384,927 cycles # 2.073 GHz ( [info] 285,624,461 stalled-cycles-frontend # 0.94% frontend cycles idle ( [info] 19,572,932,807 stalled-cycles-backend # 64.45% backend cycles idle ( [info] 35,907,154,673 instructions # 1.18 insn per cycle [info] # 0.55 stalled cycles per insn ( [info] 6,706,376,215 branches # 457.745 M/sec ( [info] 12,810,882 branch-misses # 0.19% of all branches ( [info] 16,163,472,375 L1-dcache-loads # 1103.240 M/sec ( [info] 10,975,494,751 L1-dcache-load-misses # 67.90% of all L1-dcache hits ( [info] <not supported> LLC-loads [info] <not supported> LLC-load-misses [info] 175,788,792 L1-icache-loads # 11.998 M/sec ( [info] 7,051,118 L1-icache-load-misses # 4.01% of all L1-icache hits ( [info] 5,302,762,430 dTLB-loads # 361.941 M/sec ( [info] 14,779,238 dTLB-load-misses # 0.28% of all dTLB cache hits ( [info] 245,713 iTLB-loads # 0.017 M/sec ( [info] 1,195,242 iTLB-load-misses # 486.44% of all iTLB cache hits ( [info] 289,031,594 L1-dcache-prefetches # 19.728 M/sec (
  • 115.
    while T+R L1 [info] 14,650.91msec task-clock # 0.482 CPUs utilized [info] 525 context-switches # 0.036 K/sec [info] 65 cpu-migrations # 0.004 K/sec [info] 185 page-faults # 0.013 K/sec [info] 30,370,384,927 cycles # 2.073 GHz ( [info] 285,624,461 stalled-cycles-frontend # 0.94% frontend cycles idle ( [info] 19,572,932,807 stalled-cycles-backend # 64.45% backend cycles idle ( [info] 35,907,154,673 instructions # 1.18 insn per cycle [info] # 0.55 stalled cycles per insn ( [info] 6,706,376,215 branches # 457.745 M/sec ( [info] 12,810,882 branch-misses # 0.19% of all branches ( [info] 16,163,472,375 L1-dcache-loads # 1103.240 M/sec ( [info] 10,975,494,751 L1-dcache-load-misses # 67.90% of all L1-dcache hits ( [info] <not supported> LLC-loads [info] <not supported> LLC-load-misses [info] 175,788,792 L1-icache-loads # 11.998 M/sec ( [info] 7,051,118 L1-icache-load-misses # 4.01% of all L1-icache hits ( [info] 5,302,762,430 dTLB-loads # 361.941 M/sec ( [info] 14,779,238 dTLB-load-misses # 0.28% of all dTLB cache hits ( [info] 245,713 iTLB-loads # 0.017 M/sec ( [info] 1,195,242 iTLB-load-misses # 486.44% of all iTLB cache hits ( [info] 289,031,594 L1-dcache-prefetches # 19.728 M/sec (
  • 116.
    [info] 14,601.12 msectask-clock # 0.481 CPUs utilized [info] 551 context-switches # 0.038 K/sec [info] 79 cpu-migrations # 0.005 K/sec [info] 142 page-faults # 0.010 K/sec [info] 56,887,530,219 cycles # 3.896 GHz ( [info] 371,071,536 stalled-cycles-frontend # 0.65% frontend cycles idle ( [info] 39,685,104,564 stalled-cycles-backend # 69.76% backend cycles idle ( [info] 85,472,875,315 instructions # 1.50 insn per cycle [info] # 0.46 stalled cycles per insn ( [info] 7,673,425,909 branches # 525.537 M/sec ( [info] 112,782,902 branch-misses # 1.47% of all branches ( [info] 56,302,951,054 L1-dcache-loads # 3856.070 M/sec ( [info] 7,136,733,184 L1-dcache-load-misses # 12.68% of all L1-dcache hits ( [info] <not supported> LLC-loads [info] <not supported> LLC-load-misses [info] 355,107,740 L1-icache-loads # 24.321 M/sec ( [info] 10,933,836 L1-icache-load-misses # 3.08% of all L1-icache hits ( [info] 113,201,590 dTLB-loads # 7.753 M/sec ( [info] 8,907,743 dTLB-load-misses # 7.87% of all dTLB cache hits ( [info] 46,228 iTLB-loads # 0.003 M/sec ( [info] 939,260 iTLB-load-misses # 2031.80% of all iTLB cache hits ( [info] 7,001,888,370 L1-dcache-prefetches # 479.545 M/sec (
  • 117.
    normal while [info] 14,601.12msec task-clock # 0.481 CPUs utilized [info] 551 context-switches # 0.038 K/sec [info] 79 cpu-migrations # 0.005 K/sec [info] 142 page-faults # 0.010 K/sec [info] 56,887,530,219 cycles # 3.896 GHz ( [info] 371,071,536 stalled-cycles-frontend # 0.65% frontend cycles idle ( [info] 39,685,104,564 stalled-cycles-backend # 69.76% backend cycles idle ( [info] 85,472,875,315 instructions # 1.50 insn per cycle [info] # 0.46 stalled cycles per insn ( [info] 7,673,425,909 branches # 525.537 M/sec ( [info] 112,782,902 branch-misses # 1.47% of all branches ( [info] 56,302,951,054 L1-dcache-loads # 3856.070 M/sec ( [info] 7,136,733,184 L1-dcache-load-misses # 12.68% of all L1-dcache hits ( [info] <not supported> LLC-loads [info] <not supported> LLC-load-misses [info] 355,107,740 L1-icache-loads # 24.321 M/sec ( [info] 10,933,836 L1-icache-load-misses # 3.08% of all L1-icache hits ( [info] 113,201,590 dTLB-loads # 7.753 M/sec ( [info] 8,907,743 dTLB-load-misses # 7.87% of all dTLB cache hits ( [info] 46,228 iTLB-loads # 0.003 M/sec ( [info] 939,260 iTLB-load-misses # 2031.80% of all iTLB cache hits ( [info] 7,001,888,370 L1-dcache-prefetches # 479.545 M/sec (
  • 118.
    normal while L1 [info] 14,601.12msec task-clock # 0.481 CPUs utilized [info] 551 context-switches # 0.038 K/sec [info] 79 cpu-migrations # 0.005 K/sec [info] 142 page-faults # 0.010 K/sec [info] 56,887,530,219 cycles # 3.896 GHz ( [info] 371,071,536 stalled-cycles-frontend # 0.65% frontend cycles idle ( [info] 39,685,104,564 stalled-cycles-backend # 69.76% backend cycles idle ( [info] 85,472,875,315 instructions # 1.50 insn per cycle [info] # 0.46 stalled cycles per insn ( [info] 7,673,425,909 branches # 525.537 M/sec ( [info] 112,782,902 branch-misses # 1.47% of all branches ( [info] 56,302,951,054 L1-dcache-loads # 3856.070 M/sec ( [info] 7,136,733,184 L1-dcache-load-misses # 12.68% of all L1-dcache hits ( [info] <not supported> LLC-loads [info] <not supported> LLC-load-misses [info] 355,107,740 L1-icache-loads # 24.321 M/sec ( [info] 10,933,836 L1-icache-load-misses # 3.08% of all L1-icache hits ( [info] 113,201,590 dTLB-loads # 7.753 M/sec ( [info] 8,907,743 dTLB-load-misses # 7.87% of all dTLB cache hits ( [info] 46,228 iTLB-loads # 0.003 M/sec ( [info] 939,260 iTLB-load-misses # 2031.80% of all iTLB cache hits ( [info] 7,001,888,370 L1-dcache-prefetches # 479.545 M/sec (
  • 119.
    NOT COVEREDNOT COVERED Garbagecollectors Ahead of time compilation JVM options
  • 120.
    BENCHMARKSBENCHMARKS All benchmarks abovewere performed unprofessionaly All benchmarks lie
  • 121.
    ON A GOODBENCHMARKSON A GOOD BENCHMARKS Benchmark might be ok: you test it for yourself you read about the tools tested you can �nd what causes di�erences
  • 122.
    HINTSHINTS Tools will notmagically solve your problems Tools let you feel more comfortable Do not expect PrintAssembly or Performance counters will help you in typical problems
  • 123.
    REAL PROBLEMS IHAD ON PRODUCTIONREAL PROBLEMS I HAD ON PRODUCTION hidden read of whole database in a dirty piece of code static synchronized hashCode bugs jvm bugs (metaspace) Lot of them could be spotted by using jstack
  • 124.