NUMA
&
Java Databases
Should we worry
Raghavendra Prabhu
me@rdprabhu.com
@randomsurfer
NUMA Reference architecture
What is NUMA
● Stands for Non Uniform Memory Access
○ Non Uniform to whom.
○ Von Neumann bottleneck.
○ Cache coherent NUMA
● How does it work
○ Memory is placed local to the processes.
○ Balancing access to data over the available processors on multiple nodes.
● Large memory installations are becoming the norm
○ The i2 series on AWS.
○ Databases are the main consumers.
● Constraints
○ Speed of light
○ Interconnect saturation
What is NUMA
● Constraints
○ Speed of light
■ Higher latency of accessing remote memory.
○ Interconnect saturation
■ Performance counters.
● Slow abundant memory
○ Fast limited memory
● Cache coherence
○ Processor threads and cores share resources
■ Execution units (between HT threads)
■ Cache (between threads and cores)
Exotic cases
● Network cards
● PCIe storage
● NVRAM
● Nodes without memory
● Nodes without processors
● Unbalanced
● Central/Large memory
● Big Little architecture
● GPU
Numa statistics
Tools/libraries for NUMA
● Supported by Linux since 2.5
○ Symmetric and CPU/Memory
● Numactl
● Hwloc / lstopo
● Numad
● Numatop
● Libnuma
● Numastat
● Taskset
● KVM for simulation and testing
● Perf
Tools/libraries for NUMA
● KVM for simulation and testing
● Useful for testing databases.
qemu-system-x86_64 -enable-kvm -drive file=./debian-8.1-lxc-puppet.qcow2 -net
nic,macaddr=52:54:00:00:EE:03 -net vde -smp sockets=2,cores=2,threads=2,maxcpus=16
-numa node,nodeid=0,cpus=0-3 -numa node,nodeid=1,cpus=4-7 -numa
node,nodeid=2,cpus=8-15 -m 2G
NUMA Policies
● MPOL_DEFAULT
● MPOL_BIND
● MPOL_INTERLEAVE
○ Memory striping in hardware
● MPOL_PREFERRED
● MPOL_MF_MOVE | MPOL_MF_MOVE_ALL
JVM GC spaces
● Concepts
○ Weak Generational Hypothesis:
■ Most objects soon become unreachable.
■ References from old objects to young objects only exist in small numbers.
■ The ones that do not usually survive for a (very) long time
○ Garbage Collection Roots
○ Mark &
■ Copy
■ Compact
■ Sweep
○ Minor and Major GC
○ Stop-the-World
GC graphs
JVM GC spaces
● Generations:
○ Young Generation
■ Eden space
● Mutable Space.
● Thread Local Allocation Buffer.
● Mark and Copy.
■ Survivor spaces (S0 and S1).
○ Old/Tenured Generation
○ Permanent Generation
■ => native MetaSpace in Java8
● Cross-generation links.
● Card-marking
Garbage collectors
Located in hotspot/src/share/vm/gc_implementation
● Serial
● Parallel
○ Only GC which is fully NUMA aware.
● ParNew
● Concurrent Mark and Sweep (CMS)
● Garbage First (G1)
● Official Oracle documentation is notoriously bad!
○ Code and comments are the (only) documentation (sadly).
■ Try searching for ‘NUMAPageScanRate’ - find a page from 2008 with links to sun.com and Solaris examples.
GC Options
● UseNUMA
● UseNUMAInterleaving
● ForceNUMA
● NUMAStats
● ParallelGC only
○ NUMAChunkResizeWeight
○ NUMASpaceResizeRate
○ UseAdaptiveNUMAChunkSizing
○ NUMAPageScanRate
Defined in hotspot/src/share/vm/runtime/globals.hpp and used in
hotspot/src/os/linux/vm
NUMA options
NUMA and Collectors
● -XX:+UseNUMA -XX:+UseNUMAInterleaving: All GC spaces.
○ Independent of GC choices.
○ NUMA interleaved allocation. (numactl --interleave)
● ParallelGC (in addition to above)
○ Supports all exotic NUMA options.
○ Eden mutableSpace (even without NUMA)
■ Pretouching the pages.
○ Eden mutableNUMASpace (with above NUMA options)
■ Space split into LG chunks.
● Adaptive Resizing.
■ Does thread-local NUMA allocation.
● allocations performed in chunk corresponding to the home locality.
Cassandra
● JVM options are supported through environment variable.
● Cassandra’s ‘supported’ NUMA is through numactl in shell wrapper.
○ This interleaves ‘everything’.
○ When you have numactl (hammer), everything looks like a (binary?) nail.
● Cassandra memory model
○ JVM GC spaces.
○ OHC - off heap cache: https://github.com/snazy/ohc
■ Written specifically for Cassandra 2.x
○ MemoryUtil.java
■ com.sun.jna.Native - Native.malloc
■ sun.nio.ch.DirectBuffer
■ sun.misc.Unsafe - unsafe.allocateMemory
■ java.nio.ByteBuffer - ByteBuffer.allocateDirect
Cassandra off-heap
● Why off-heap
○ Reduce GC pressure
○ Access patterns
○ Lack of support for primitives such as O_DIRECT. (https://bugs.openjdk.java.net/browse/JDK-8164900)
○ Lack of NUMA support in newer GCs.
■ ( JEP 157: G1 GC: NUMA-Aware Allocation http://openjdk.java.net/jeps/157)
● Off-heap caches are used for:
○ Row cache
○ Key cache
○ Counter cache
● 2.x onwards, actually better with 2.2.
Cassandra off-heap
● Cache Providers:
○ SerializingCache
■ Issues with serialization and CPU usage.
○ OHCP - org.caffinitas.ohc.OHCacheBuilder - 2.2 onwards
■ “OHC shall provide a good performance on both commodity hardware and big systems using
non-uniform-memory-architectures.”
■ sun.misc.Unsafe: unsafe.allocateMemory
■ Linked: For Larger entries
● Malloc and fragmentation
■ Chunked: For smaller entries
Numa issues
● Numactl --interleave:
○ Thread-local native allocations - Bad [X]
■ Tons of them throughout code which bypass JVM.
○ JVM’s Eden space will also be interleaved - Bad [X]
● JVM’s options only:
○ Native allocations will be local.
○ Large off-heap allocations can suffer.
● Numactl + JVM
■ JVM-aware GC (Parallel)
● Best possible combination (without invasive code changes in cassandra).
● JVM’s memory options will override numactl.
● But, ParallelGC is not comparable to new ones (G1).
Interpretation
● Low off-heap usage
○ Use the JVM NUMA options. Don’t interleave with numactl, it is a hammer.
● High off-heap usage (like cassandra)
○ Just go with the flow, and do numactl.
■ -XX:+AlwaysPreTouch? (MAP_POPULATE)
○ Cost-benefit analysis.
● ParallelGC is too old (and bad for latency) - don’t use it just for NUMA.
○ Well-implemented NUMA can easily pique anyone’s geeky senses. :)
○ Ask Cassandra or Oracle to add NUMA support to G1 ;)
● In newer kernels (Xenial), one can try AutoNUMA.
○ Completely managed by kernel based on access patterns.
○ Has caveats but one can always benchmark and see. :)
Interpretation
● JVM is (still) not good with native primitives such as O_DIRECT or NUMA (there is a
jnuma which is not that well maintained).
○ Many database authors write their own off-JVM implementations for these. (there are so many java
databases these days)
○ Some also do things like this.
○ MySQL (InnoDB) can (and does) take advantage of these for good performance.
■ InnoDB was in Cassandra’s place about two years ago, till fixes landed.
● How InnoDB does it.
○ May be ScyllaDB in future. ;)
Wishlist for cassandra
● Use whatever GC fits best. (G1?)
■ Ask for NUMA support in this.
● Use the JVM NUMA options when supported.
■ Having NUMA support for Eden spaces will help a lot.
● Don’t use numactl.
○ Let all native allocations be local (OS default).
○ Use jnuma (or equivalent, it is just a JNI wrapper) for OHCP and other large non-local caches.
■ Use numa interleaving here.
■ This requires cassandra or OHCP code to be changed.
● Changing OHCP code is easier.
● Benchmark
○ ??
○ Profit!
AutoNUMA
● Introduced late in 4.x kernel
● CPU follows memory
○ Reschedule tasks on same nodes as memory
● Memory follows CPU
○ Copy memory pages to same nodes as tasks/threads
● Heuristics
○ Fault statistics
○ Task grouping
○ Multi-resource optimization - cache, cpu, memory, starvation
■ Avoid thrashing
Tunings and observables
● /proc/zoneinfo
○ Sysctl vm.zone_reclaim_mode OR /proc/sys/vm/zone_reclaim
○ /proc/sys/vm/min_unmapped_ratio
● /proc/meminfo
● /proc/vmstat
● Ftrace / Perf
● Cgroup hierarchy
○ Memory
● Per process:
○ /proc/<pid>/numa_maps
○ /proc/<pid>/sched
Numa statistics
Further
● http://frankdenneman.nl/2016/07/07/numa-deep-dive-part-1-uma-numa/
● http://queue.acm.org/detail.cfm?id=2852078
● https://plumbr.eu/java-garbage-collection-handbook
● http://mechanical-sympathy.blogspot.in/2013/07/java-garbage-collection-
distilled.html
Credits!
● http://queue.acm.org/detail.cfm?id=2513149
● www.linux-kvm.org/images/7/75/01x07b-NumaAutobalancing.pdf
● http://events.linuxfoundation.org/sites/events/files/slides/Normal%20and
%20Exotic%20use%20cases%20for%20NUMA%20features.pdf
● https://en.wikipedia.org/wiki/Non-uniform_memory_access
● https://lihz1990.gitbooks.io/transoflptg/content/02.%E7%9B%91%E6%8E
%A7%E5%92%8C%E5%8E%8B%E6%B5%8B%E5%B7%A5%E5%85%B7/sam
ple-output-of-the-numastat-command.png
NUMA and Java Databases
NUMA and Java Databases

NUMA and Java Databases

  • 1.
    NUMA & Java Databases Should weworry Raghavendra Prabhu me@rdprabhu.com @randomsurfer
  • 2.
  • 4.
    What is NUMA ●Stands for Non Uniform Memory Access ○ Non Uniform to whom. ○ Von Neumann bottleneck. ○ Cache coherent NUMA ● How does it work ○ Memory is placed local to the processes. ○ Balancing access to data over the available processors on multiple nodes. ● Large memory installations are becoming the norm ○ The i2 series on AWS. ○ Databases are the main consumers. ● Constraints ○ Speed of light ○ Interconnect saturation
  • 5.
    What is NUMA ●Constraints ○ Speed of light ■ Higher latency of accessing remote memory. ○ Interconnect saturation ■ Performance counters. ● Slow abundant memory ○ Fast limited memory ● Cache coherence ○ Processor threads and cores share resources ■ Execution units (between HT threads) ■ Cache (between threads and cores)
  • 6.
    Exotic cases ● Networkcards ● PCIe storage ● NVRAM ● Nodes without memory ● Nodes without processors ● Unbalanced ● Central/Large memory ● Big Little architecture ● GPU
  • 7.
  • 8.
    Tools/libraries for NUMA ●Supported by Linux since 2.5 ○ Symmetric and CPU/Memory ● Numactl ● Hwloc / lstopo ● Numad ● Numatop ● Libnuma ● Numastat ● Taskset ● KVM for simulation and testing ● Perf
  • 9.
    Tools/libraries for NUMA ●KVM for simulation and testing ● Useful for testing databases. qemu-system-x86_64 -enable-kvm -drive file=./debian-8.1-lxc-puppet.qcow2 -net nic,macaddr=52:54:00:00:EE:03 -net vde -smp sockets=2,cores=2,threads=2,maxcpus=16 -numa node,nodeid=0,cpus=0-3 -numa node,nodeid=1,cpus=4-7 -numa node,nodeid=2,cpus=8-15 -m 2G
  • 10.
    NUMA Policies ● MPOL_DEFAULT ●MPOL_BIND ● MPOL_INTERLEAVE ○ Memory striping in hardware ● MPOL_PREFERRED ● MPOL_MF_MOVE | MPOL_MF_MOVE_ALL
  • 11.
    JVM GC spaces ●Concepts ○ Weak Generational Hypothesis: ■ Most objects soon become unreachable. ■ References from old objects to young objects only exist in small numbers. ■ The ones that do not usually survive for a (very) long time ○ Garbage Collection Roots ○ Mark & ■ Copy ■ Compact ■ Sweep ○ Minor and Major GC ○ Stop-the-World
  • 12.
  • 13.
    JVM GC spaces ●Generations: ○ Young Generation ■ Eden space ● Mutable Space. ● Thread Local Allocation Buffer. ● Mark and Copy. ■ Survivor spaces (S0 and S1). ○ Old/Tenured Generation ○ Permanent Generation ■ => native MetaSpace in Java8 ● Cross-generation links. ● Card-marking
  • 14.
    Garbage collectors Located inhotspot/src/share/vm/gc_implementation ● Serial ● Parallel ○ Only GC which is fully NUMA aware. ● ParNew ● Concurrent Mark and Sweep (CMS) ● Garbage First (G1) ● Official Oracle documentation is notoriously bad! ○ Code and comments are the (only) documentation (sadly). ■ Try searching for ‘NUMAPageScanRate’ - find a page from 2008 with links to sun.com and Solaris examples.
  • 15.
  • 16.
    ● UseNUMA ● UseNUMAInterleaving ●ForceNUMA ● NUMAStats ● ParallelGC only ○ NUMAChunkResizeWeight ○ NUMASpaceResizeRate ○ UseAdaptiveNUMAChunkSizing ○ NUMAPageScanRate Defined in hotspot/src/share/vm/runtime/globals.hpp and used in hotspot/src/os/linux/vm NUMA options
  • 17.
    NUMA and Collectors ●-XX:+UseNUMA -XX:+UseNUMAInterleaving: All GC spaces. ○ Independent of GC choices. ○ NUMA interleaved allocation. (numactl --interleave) ● ParallelGC (in addition to above) ○ Supports all exotic NUMA options. ○ Eden mutableSpace (even without NUMA) ■ Pretouching the pages. ○ Eden mutableNUMASpace (with above NUMA options) ■ Space split into LG chunks. ● Adaptive Resizing. ■ Does thread-local NUMA allocation. ● allocations performed in chunk corresponding to the home locality.
  • 18.
    Cassandra ● JVM optionsare supported through environment variable. ● Cassandra’s ‘supported’ NUMA is through numactl in shell wrapper. ○ This interleaves ‘everything’. ○ When you have numactl (hammer), everything looks like a (binary?) nail. ● Cassandra memory model ○ JVM GC spaces. ○ OHC - off heap cache: https://github.com/snazy/ohc ■ Written specifically for Cassandra 2.x ○ MemoryUtil.java ■ com.sun.jna.Native - Native.malloc ■ sun.nio.ch.DirectBuffer ■ sun.misc.Unsafe - unsafe.allocateMemory ■ java.nio.ByteBuffer - ByteBuffer.allocateDirect
  • 19.
    Cassandra off-heap ● Whyoff-heap ○ Reduce GC pressure ○ Access patterns ○ Lack of support for primitives such as O_DIRECT. (https://bugs.openjdk.java.net/browse/JDK-8164900) ○ Lack of NUMA support in newer GCs. ■ ( JEP 157: G1 GC: NUMA-Aware Allocation http://openjdk.java.net/jeps/157) ● Off-heap caches are used for: ○ Row cache ○ Key cache ○ Counter cache ● 2.x onwards, actually better with 2.2.
  • 20.
    Cassandra off-heap ● CacheProviders: ○ SerializingCache ■ Issues with serialization and CPU usage. ○ OHCP - org.caffinitas.ohc.OHCacheBuilder - 2.2 onwards ■ “OHC shall provide a good performance on both commodity hardware and big systems using non-uniform-memory-architectures.” ■ sun.misc.Unsafe: unsafe.allocateMemory ■ Linked: For Larger entries ● Malloc and fragmentation ■ Chunked: For smaller entries
  • 21.
    Numa issues ● Numactl--interleave: ○ Thread-local native allocations - Bad [X] ■ Tons of them throughout code which bypass JVM. ○ JVM’s Eden space will also be interleaved - Bad [X] ● JVM’s options only: ○ Native allocations will be local. ○ Large off-heap allocations can suffer. ● Numactl + JVM ■ JVM-aware GC (Parallel) ● Best possible combination (without invasive code changes in cassandra). ● JVM’s memory options will override numactl. ● But, ParallelGC is not comparable to new ones (G1).
  • 22.
    Interpretation ● Low off-heapusage ○ Use the JVM NUMA options. Don’t interleave with numactl, it is a hammer. ● High off-heap usage (like cassandra) ○ Just go with the flow, and do numactl. ■ -XX:+AlwaysPreTouch? (MAP_POPULATE) ○ Cost-benefit analysis. ● ParallelGC is too old (and bad for latency) - don’t use it just for NUMA. ○ Well-implemented NUMA can easily pique anyone’s geeky senses. :) ○ Ask Cassandra or Oracle to add NUMA support to G1 ;) ● In newer kernels (Xenial), one can try AutoNUMA. ○ Completely managed by kernel based on access patterns. ○ Has caveats but one can always benchmark and see. :)
  • 23.
    Interpretation ● JVM is(still) not good with native primitives such as O_DIRECT or NUMA (there is a jnuma which is not that well maintained). ○ Many database authors write their own off-JVM implementations for these. (there are so many java databases these days) ○ Some also do things like this. ○ MySQL (InnoDB) can (and does) take advantage of these for good performance. ■ InnoDB was in Cassandra’s place about two years ago, till fixes landed. ● How InnoDB does it. ○ May be ScyllaDB in future. ;)
  • 24.
    Wishlist for cassandra ●Use whatever GC fits best. (G1?) ■ Ask for NUMA support in this. ● Use the JVM NUMA options when supported. ■ Having NUMA support for Eden spaces will help a lot. ● Don’t use numactl. ○ Let all native allocations be local (OS default). ○ Use jnuma (or equivalent, it is just a JNI wrapper) for OHCP and other large non-local caches. ■ Use numa interleaving here. ■ This requires cassandra or OHCP code to be changed. ● Changing OHCP code is easier. ● Benchmark ○ ?? ○ Profit!
  • 25.
    AutoNUMA ● Introduced latein 4.x kernel ● CPU follows memory ○ Reschedule tasks on same nodes as memory ● Memory follows CPU ○ Copy memory pages to same nodes as tasks/threads ● Heuristics ○ Fault statistics ○ Task grouping ○ Multi-resource optimization - cache, cpu, memory, starvation ■ Avoid thrashing
  • 26.
    Tunings and observables ●/proc/zoneinfo ○ Sysctl vm.zone_reclaim_mode OR /proc/sys/vm/zone_reclaim ○ /proc/sys/vm/min_unmapped_ratio ● /proc/meminfo ● /proc/vmstat ● Ftrace / Perf ● Cgroup hierarchy ○ Memory ● Per process: ○ /proc/<pid>/numa_maps ○ /proc/<pid>/sched
  • 27.
  • 28.
    Further ● http://frankdenneman.nl/2016/07/07/numa-deep-dive-part-1-uma-numa/ ● http://queue.acm.org/detail.cfm?id=2852078 ●https://plumbr.eu/java-garbage-collection-handbook ● http://mechanical-sympathy.blogspot.in/2013/07/java-garbage-collection- distilled.html
  • 29.
    Credits! ● http://queue.acm.org/detail.cfm?id=2513149 ● www.linux-kvm.org/images/7/75/01x07b-NumaAutobalancing.pdf ●http://events.linuxfoundation.org/sites/events/files/slides/Normal%20and %20Exotic%20use%20cases%20for%20NUMA%20features.pdf ● https://en.wikipedia.org/wiki/Non-uniform_memory_access ● https://lihz1990.gitbooks.io/transoflptg/content/02.%E7%9B%91%E6%8E %A7%E5%92%8C%E5%8E%8B%E6%B5%8B%E5%B7%A5%E5%85%B7/sam ple-output-of-the-numastat-command.png