Ruby for
Distributed Storage Systems
RubyKaigi 2017: Sep 20, 2017
Satoshi Tagomori (@tagomoris)
Treasure Data, Inc.
Satoshi Tagomori (@tagomoris)
Fluentd, MessagePack-Ruby, Norikra, Woothee, ...
Treasure Data, Inc.
-45°
Ruby for
Distributed Storage Systems
Ruby for and
Distributed Storage Systems
Ruby and Performance
• Web? or Not?
• Disk & Network I/O
• "I/O spends most of time on servers"... is it real?
• Storages are getting faster and faster

(SSD, NVMe, ...)
• Networks too (10GbE, fast network in Cloud, ...)
Storage Systems
• Disk I/O
• Network I/O
• Serialization / Deserialization (json, msgpack, ...)
• read/write data from/to disk
• parse/generate HTTP request/response
• Indexing (update, search)
• Timer
• Threads + Locks
Distributed Storage Systems
• Data replication
• Checksum
• Asynchronous network I/O
• Quorum
• More Threads + Locks
Replication w/ 3 replicas
• Create 3 replica of data, including local storage
accept request
to write data
write the data
into local storage
(1)
receive responses
to replicate data
(3)
send response
to write data
input input
input
input
input
input input
input
input
input input
input
input
input input
send requests
to replicate data
Replication in Quoram Systems: In Action
• Create 2 replica of data at least (max 3), including local storage
accept request
to write data,
and write it locally
(1)
send response
to write data
input input
input
input
input
? ?
input
input
input
input
input
input
create 2 threads
to send requests
to replicate data
input
input
receive a successful

response to

replicate data
(2)
? ? ?
Discard a thread
for another node
Bigdam
Bigdam
• Brand new data ingestion pipeline

in Treasure Data
• Huge data
• Extraordinary large number of connections /
requests
• Many edge endpoints on the planet
Bigdam:

Edge locations on the earth + the Central location
Bigdam
components
@narittan
@tagomoris @nalsh
@k0kubun
@komamitsu_tw
Bigdam-pool
• OSS (in future... not yet)
• Distributed key-value storage
• for buffer pool in Bigdam
• to build S3 free data ingestion pipeline
Bigdam-pool: Small Buffers
• Small buffers (MBs)
• Write: append support for many small chunks (KBs)
• Read: secondary index to query/read many buffers at once
• Short buffer lifetime: minutes (create - append - read - delete)
• Buffers store ids of chunks (for deduplication)
buffer buffer buffer buffer
chunk
chunk
chunk
chunk
chunk
chunk
chunk
chunk
chunk
chunk
account_id, database, table
Bigdam-pool: Replication
• Replication in a cluster
• without maintaining replica factor
• Clients send requests to all living nodes
Bigdam-pool: Buffer Transferring over Clusters
Edge location Central location
Over Internet
Using HTTPS or HTTP/2
Buffer committed
(size or timeout)
written in Java
Designing Bigdam
• Architecture Design - split a system to 5 microservices
• consistency, availability
• performance (how to scale it out?)
• deployment, cost
• API Design
• Mocking
• Interface Test
• Integration Test
Mocking Bigdam using Ruby
• Mocking
• build mock servers of all components
• implement all public APIs between components
• Find/add missing parameters required
• Prepare to develop components in parallel
• Mocked using Ruby, Sinatra
• public APIs - it's just a Webapp
• fast and easy to do :D
Interface/Integration Tests of Bigdam using Ruby
• Interface tests:
• verify all public APIs are implemented correctly
• Integration tests
• verify the whole pipeline can import data correctly
• Written in Ruby, test-unit
• less code to serialize/deserialize various req/res
• readable test cases
• fast and easy to do :D
And,
Bigdam-pool-ruby
• Port bigdam-pool from Java to Ruby
• Experiment to know Ruby is good enough or
not
Bigdam-pool-ruby
• Perfectly compatible with Java implementation
• Public API, Private API
• Data formats on local storage, of secondary
index
• Under development
• only supports stand alone mode, for now
Studies: Serialization / Deserialization
• All network API call requires it
• parsing HTTP request
• parsing request content body (json/msgpack)
• building response content body (json/msgpack)
• building HTTP response
• Should be parallelized on CPU cores
Studies: Asynchronous Network I/O
• EventMachine? Cool.io? Celluloid::IO?
• 🤔
• I want to use only async network I/O at a time!

(not disk, not timer)
• Event driven I/O library?
• Thread pools + callback?
• or any idea?
Threading / Timers
• ExecutorService in Java is very useful...
• Fixed / non-fixed thread pools with Queue
• (and some other executor models)
• Runner of Runnable tasks
• "Runnable task" is just like a lambda w/o args
• To be implemented as Gem?
• Queue and SizedQueue look useful for it
Queue#peek

Get the head object w/o removing it from queue
https://github.com/ruby/ruby/pull/1698
Queue#peek

Get the head object w/o removing it from queue
https://github.com/ruby/ruby/pull/1698
MonitorMixin#mon_locked? and #mon_owned?
• Mutex#owned? exists
https://github.com/ruby/ruby/pull/1699
Resource Control
Make sure to release resources: try-with-resources in Java
Resource Control
Make sure to release resources: try-with-resources in Java
Typing?
• Defining APIs
• Rubyists (including me) MAY be using:

[string, integer, boolean, string, ...]
• Rubyists (including me) MAY be using:

{"time": unix_time (but sometimes float)}

• Explicit definition makes nothing bad in designing APIs
• Json schema or something others may help us...
Typing: in logging and others
https://bugs.ruby-lang.org/issues/13913
Process Built-in Application Servers
• Distributed Storage Systems:
• Background worker threads
• Timers
• Communication workers to other nodes
• Various async operation workers
• Public API request handlers
• Private API request handlers (inter-nodes)
• Startup/Shutdown hooks
• It's NOT just web application, but handles HTTP requests
https://github.com/tagomoris/bigdam-pool-ruby
https://github.com/tagomoris/bigdam-pool-ruby
NOT YET
"Why Do You Want to Write Such Code in Ruby?"
"Why Do You Want to Write Such Code in Ruby?"
"Because I WANT TO DO IT!"
"Why Do You Want to Write Such Code in Ruby?"
"Because I WANT TO DO IT!"
"... And we already have Fluentd :P"
Thank you.
@tagomoris

Ruby and Distributed Storage Systems

  • 1.
    Ruby for Distributed StorageSystems RubyKaigi 2017: Sep 20, 2017 Satoshi Tagomori (@tagomoris) Treasure Data, Inc.
  • 2.
    Satoshi Tagomori (@tagomoris) Fluentd,MessagePack-Ruby, Norikra, Woothee, ... Treasure Data, Inc.
  • 4.
  • 5.
  • 6.
    Ruby for and DistributedStorage Systems
  • 7.
    Ruby and Performance •Web? or Not? • Disk & Network I/O • "I/O spends most of time on servers"... is it real? • Storages are getting faster and faster
 (SSD, NVMe, ...) • Networks too (10GbE, fast network in Cloud, ...)
  • 8.
    Storage Systems • DiskI/O • Network I/O • Serialization / Deserialization (json, msgpack, ...) • read/write data from/to disk • parse/generate HTTP request/response • Indexing (update, search) • Timer • Threads + Locks
  • 9.
    Distributed Storage Systems •Data replication • Checksum • Asynchronous network I/O • Quorum • More Threads + Locks
  • 10.
    Replication w/ 3replicas • Create 3 replica of data, including local storage accept request to write data write the data into local storage (1) receive responses to replicate data (3) send response to write data input input input input input input input input input input input input input input input send requests to replicate data
  • 11.
    Replication in QuoramSystems: In Action • Create 2 replica of data at least (max 3), including local storage accept request to write data, and write it locally (1) send response to write data input input input input input ? ? input input input input input input create 2 threads to send requests to replicate data input input receive a successful
 response to
 replicate data (2) ? ? ? Discard a thread for another node
  • 12.
  • 13.
    Bigdam • Brand newdata ingestion pipeline
 in Treasure Data • Huge data • Extraordinary large number of connections / requests • Many edge endpoints on the planet
  • 14.
    Bigdam:
 Edge locations onthe earth + the Central location
  • 15.
  • 16.
    Bigdam-pool • OSS (infuture... not yet) • Distributed key-value storage • for buffer pool in Bigdam • to build S3 free data ingestion pipeline
  • 17.
    Bigdam-pool: Small Buffers •Small buffers (MBs) • Write: append support for many small chunks (KBs) • Read: secondary index to query/read many buffers at once • Short buffer lifetime: minutes (create - append - read - delete) • Buffers store ids of chunks (for deduplication) buffer buffer buffer buffer chunk chunk chunk chunk chunk chunk chunk chunk chunk chunk account_id, database, table
  • 18.
    Bigdam-pool: Replication • Replicationin a cluster • without maintaining replica factor • Clients send requests to all living nodes
  • 19.
    Bigdam-pool: Buffer Transferringover Clusters Edge location Central location Over Internet Using HTTPS or HTTP/2 Buffer committed (size or timeout)
  • 20.
  • 21.
    Designing Bigdam • ArchitectureDesign - split a system to 5 microservices • consistency, availability • performance (how to scale it out?) • deployment, cost • API Design • Mocking • Interface Test • Integration Test
  • 22.
    Mocking Bigdam usingRuby • Mocking • build mock servers of all components • implement all public APIs between components • Find/add missing parameters required • Prepare to develop components in parallel • Mocked using Ruby, Sinatra • public APIs - it's just a Webapp • fast and easy to do :D
  • 23.
    Interface/Integration Tests ofBigdam using Ruby • Interface tests: • verify all public APIs are implemented correctly • Integration tests • verify the whole pipeline can import data correctly • Written in Ruby, test-unit • less code to serialize/deserialize various req/res • readable test cases • fast and easy to do :D
  • 24.
  • 25.
    Bigdam-pool-ruby • Port bigdam-poolfrom Java to Ruby • Experiment to know Ruby is good enough or not
  • 27.
    Bigdam-pool-ruby • Perfectly compatiblewith Java implementation • Public API, Private API • Data formats on local storage, of secondary index • Under development • only supports stand alone mode, for now
  • 28.
    Studies: Serialization /Deserialization • All network API call requires it • parsing HTTP request • parsing request content body (json/msgpack) • building response content body (json/msgpack) • building HTTP response • Should be parallelized on CPU cores
  • 29.
    Studies: Asynchronous NetworkI/O • EventMachine? Cool.io? Celluloid::IO? • 🤔 • I want to use only async network I/O at a time!
 (not disk, not timer) • Event driven I/O library? • Thread pools + callback? • or any idea?
  • 30.
    Threading / Timers •ExecutorService in Java is very useful... • Fixed / non-fixed thread pools with Queue • (and some other executor models) • Runner of Runnable tasks • "Runnable task" is just like a lambda w/o args • To be implemented as Gem? • Queue and SizedQueue look useful for it
  • 31.
    Queue#peek
 Get the headobject w/o removing it from queue https://github.com/ruby/ruby/pull/1698
  • 32.
    Queue#peek
 Get the headobject w/o removing it from queue https://github.com/ruby/ruby/pull/1698
  • 33.
    MonitorMixin#mon_locked? and #mon_owned? •Mutex#owned? exists https://github.com/ruby/ruby/pull/1699
  • 34.
    Resource Control Make sureto release resources: try-with-resources in Java
  • 35.
    Resource Control Make sureto release resources: try-with-resources in Java
  • 36.
    Typing? • Defining APIs •Rubyists (including me) MAY be using:
 [string, integer, boolean, string, ...] • Rubyists (including me) MAY be using:
 {"time": unix_time (but sometimes float)}
 • Explicit definition makes nothing bad in designing APIs • Json schema or something others may help us...
  • 37.
    Typing: in loggingand others https://bugs.ruby-lang.org/issues/13913
  • 38.
    Process Built-in ApplicationServers • Distributed Storage Systems: • Background worker threads • Timers • Communication workers to other nodes • Various async operation workers • Public API request handlers • Private API request handlers (inter-nodes) • Startup/Shutdown hooks • It's NOT just web application, but handles HTTP requests
  • 39.
  • 40.
  • 41.
    "Why Do YouWant to Write Such Code in Ruby?"
  • 42.
    "Why Do YouWant to Write Such Code in Ruby?" "Because I WANT TO DO IT!"
  • 43.
    "Why Do YouWant to Write Such Code in Ruby?" "Because I WANT TO DO IT!" "... And we already have Fluentd :P" Thank you. @tagomoris