RocksDB Storage Engine
Igor Canadi | Facebook
Overview
•  Story of RocksDB
•  Architecture
•  Performance tuning
•  Next steps
1
Story of RocksDB
Pre-2011
•  FB infrastructure – many custom-built key-value stores
•  LevelDB released
2
Experimentation (2011 – 2013)
•  First use-cases
•  Not designed for server – many bottlenecks, stalls
•  Optimization
•  New features
3
Explosion (2013 – 2015)
•  Open sourced RocksDB
•  Big success within Facebook
•  External traction – Linkedin, Yahoo, CockroachDB, …
4
New Challenges (2015 - )
•  Bring RocksDB to databases
5
MongoRocks
•  Running in production at Parse for 6 months
•  Huge storage savings (5TB à 285GB)
•  Document-level locking
6
MyRocks
7
InnoDB RocksDB
0
0.2
0.4
0.6
0.8
1
1.2
Database size (relative)
InnoDB
RocksDB
InnoDB RocksDB
0
0.2
0.4
0.6
0.8
1
1.2
Bytes written (relative)
InnoDB
RocksDB
Architecture
Log Structured Merge Trees
Log Structured Merge Trees
8
(64MB)
(256MB)
(512MB)
(5GB)
(50GB)
(500GB)
Memtable
Level 0
Level 1
Level 2
Level 3
Level 4
Log Structured Merge Trees – write
9
(64MB)
(256MB)
Memtable
Level 0
(key,value)
Log Structured Merge Trees – flush
10
(64MB)
(256MB)
Memtable
Level 0
Log Structured Merge Trees – compaction
11
(5GB)
(50GB)
Level 2
Level 3
Writes
•  Foreground:
•  Writes go to memtable (skiplist) + write-ahead log
•  Background:
•  When memtable is full, we flush to Level 0
•  When a level is full, we run compaction
12
Reads
13
(64MB)
(256MB)
(512MB)
(5GB)
(50GB)
(500GB)
Memtable
Level 0
Level 1
Level 2
Level 3
Level 4
Reads
•  Point queries
•  Bloom filters reduce reads from storage
•  Usually only 1 read IO
•  Range scans
•  Bloom filters don’t help
•  Depends on amount of memory, 1-2 IO
14
RocksDB Files
15
rocksdb/> ls

MANIFEST-000032

000024.log

000031.log

000025.sst

000028.sst

000029.sst

000033.sst

000034.sst

LOG

LOG.old.1441234029851978

...
RocksDB Files – MANIFEST
16
(initial state)
Add file 1
Add file 2
Add file 3
Add file 4
…
(flush)
Add file 9
Mark log 6 persisted
(compaction)
Add file 10
Add file 11
Remove file 9
Remove file 8
Add new column
family “system”
•  Atomical updates to database metadata
RocksDB Files – Write-ahead log
17
Write (A, B) Write (C, D)
Write (E, F)
Delete(A) Write(X, Y)
Delete(C)
•  Persisted memtable state
RocksDB Files – Table files
18
(Data block)
•  compressed
•  prefix encoded
(Data block)
<key, value>
(Data block) (Data block)
(Data block) (Data block) (Data block) (Data block)
(Index block)
<key, block>
(Filter block) (Statistics) (Meta index block)
Pointers to blocks
RocksDB Files – LOG files
•  Debugging output
•  Tuning options
•  Information about flushes and compactions
•  Performance statistics
19
Backups
•  Table files are immutable
•  Other files are append-only
•  Easy and fast incremental backups
•  Open sourced Rocks-Strata
20
Performance tuning
Tombstones
•  Deletions are deferred
•  May cause higher P99 latencies
•  Be careful with pathological workloads, e.g. queues
21
Caching
22
Block cache
•  Managed by RocksDB
•  Uncompressed data
•  Defaults to 1/3 of RAM
Page cache
•  Managed by kernel
•  Compressed data
Memory usage
•  Block cache
•  Index and filter blocks (0.5 – 2% of the database)
•  Memtables
•  Blocks pinned by iterators
23
Reduce memory usage
•  Reduce block cache size – will increase CPU
•  Increase block size – decrease index size
•  Turn off bloom filters on bottom level
24
Reduce CPU
•  Profile the CPU usage
•  Increase block cache size – will increase memory usage
•  Turn off compression
•  It might be tombstones
25
Reduce write amplification
•  Write amplification = 5 * num_levels
•  Increase memtable and level 1 size
•  Stronger (zlib, zstd) compression for bottom levels
•  Try universal compaction
26
Next steps
Next steps
•  Increase performance & stability
•  Deploy MyRocks at Facebook
•  External adoption of MyRocks and MongoRocks
•  Build an ecosystem
27
Thank you

RocksDB storage engine for MySQL and MongoDB