Large-scale projects development (scaling LAMP)

Large-scale projects
development
Alexey Rybak, Badoo

Devconf, 10 june 2012

Who am I?
• developer/manager/director roles in

2005 - …
2004 - 2005
and others,
1999 - 2004

• this tutorial – hobby educational
project since 2006

Rate yourself, please
• Worked primarily on one-server or
shared hosting systems, want to know
basics of large-scale architectures and
scaling techniques
• Already have several servers in
production, want to know how to grow on
• Know all the things more or less, just
want to systematize my knowledge and get
answers to particular questions

Few more introductory words

• Technology stack – LAMP
• Most of problems have
fundamental, stack-independent
nature
• Interrupt, ask questions
• Is flipchart visible? We will have
several flipchart sessions

Tutorial schedule
• Introduction: values & principals
• Web/applications and cache tiers
• Databases, sharding
• Queues
• Lean production: measuring
• Questions session (min. 1 hour)

1. Introduction: values and principals

Why values?
• next message is for developers
• already worked in big projects? you know this
• no? please, open your mind
• something may sound wrong
• while it’s sad but true

In large-scale projects
• programming as writing code matters less
• system design is the key
• system design is not about
• patterns
• classes
• modules
• API …
• not about any writing code practice or
code design

System design
• putting various components together
• software and hardware
• most of components are “ready”
• know these components
• more engineering
• less traditional “programming”

System design
• focused on business values
• performance + cost of ownership
• more clients (requests) with less money
invested
• operations with less resources, minimum
downtime…
• performance, high availability, reliability,
recovery… many other buzz-words
• can be painful for developers as it’s about
managing unknowns

Scalability: an ability to grow

Linear, good performance
$$$ (income)

Non-linear
pe
rfo
rm

Linear, but bad performance
an
ce

$$$ (spending)
• Scalability and performance determine your growth together
• Scalability is the class of the function
• Performance is function parameter (here: angle)
• Will talk about both scalability and performance

Scaling
• vertical: scale in (improving hardware)
• horizontal: scale out (adding boxes)
• components coupling matters
• key to horizontal scaling is weak coupling
between subsystems (share nothing =
weak/loose coupling)

Queueing theory
• Just to introduce basic models
• Massive flow of random requests:
• Telecommunications
• call-centers
• supermarkets
• filling (gas) stations
• airports
• fast-food
• Disneyland...
• and internet projects
• Started by A. K. Erlang, «The Theory of
Probabilities and Telephone conversations»,
1909

Basic model: single-server queue
queue server
requests processed requests

overflow: failure
Characteristics:
• processed requests/sec (throughput)
• total processing time (latency)
• failures/sec (quality)
• many others

Important property: rapid non-linear performance degradation

Multiple-server queue
servers

queue
requests processed

• queue + N servers performs better than N (queue + server)
• find these models in your project, they form your architecture basis

System design
• Goal: components are coupled in the most
effective way
• Method: imagine it’s all queues and analyze data
processing flows
• Components
• High-level (software)
• Low-level (hardware)

High-level components
• Your software + ready building blocks
• “Ready” software:
• web servers
• application servers (can be incorporated
into web)
• cache servers
• database servers

Each based on
• Hardware
• CPU
• memory
• disk
• network
• OS
• Linux/UNIX parallelism

Hardware: data flow limits

CPU < 1E-9 s Memory

#00 #01
1E-7 – 1E-6 s
FS cache
cache cache

HDD > 1E-3 s Network
• sequential: ~100MB/sec
• random: ~200 req/sec
~1e-5 s
• database IO isn’t sequential Random reads from memory via
• SSD rocks in random IO network is faster than using a disk

Hardware: conclusions
• reading from other box memory can be
significantly faster than reading from local disk
• weakest link: random HDD IO (databases)
• sequential bulk reads/writes are more effective
• batch writes: accumulate data in memory and
sync
• databases use combination of these
techniques
• battery backed write cache
• SSD: much faster random access

Components splitting
Section#2: Applications
Incoming HTTP-traffic
Section#3: Data

Front-end: connection handling Other applications
clusters, involved into
Back-end: application cluster request processing

Cache: fast memory storage
Queueing, jobs,
analytical applications…
Sharded databases: split disk writes

In next sections we’ll discuss
• why this splitting is effective
• how to scale app/cache/db tiers horizontally

Why frontend and backend?
Incoming HTTP-traffic

Front-end: connection handling

Back-end: application cluster

C10K problem – serving 10K connections
Need to know
• OS parallelism
• server models

Linux: parallelism
• processes
• threads
• multitasking, interrupts: context switch
• the key property is how servers
handle network connections

Servers models

• Process per connection
• Thread per connection
• FSM (finite state machine)

Connection handling
• process-per-connection (apache 1, 2 mpm_prefork)
• slow clients = many processes
• thread-per-connection (apache 2 mpm_worker)
• slow clients = many threads
• Keep-Alive – 90% clients
• Overhead: context switches, RAM
• “lightweight“: nginx (engine-x), lighttpd (lighty), …

Servers models
• Process per connection
• CGI: fork per connection
• Pooling: Apache (v.1, mpm_prefork – min, max,
spare), PostgreSQL+pgpool, PHP-FPM …
• Thread per connection
• Pooling: Apache (mpm_worker – min, max, spare),
MySQL(thread_cache)
• FSM (finite state machine)
• “modern” kernel: kqueue, epoll
• interface: libevent, libev
• FSM + process pooling: nginx
• FSM + thread pooling: memcached v>1.4

Nginx
• 1 master + N workers (10**3 – 10**4 conn)
• N ~ CPU cores * (blocking IO probability)
• FSM
• maniacal attention to speed and code quality
• Keep-Alive: 100Kbytes active / 250 bytes inactive
• logical, flexible, scalable configuration
• with even embedded castrated perl
• nginx.com

[front/back]end
• What does web-server do?
• Executes script code
• Serves client
• Hey, does cook talk to restaurant
customers?
• These tasks are different, split to
frontend/backend
• nginx + Apache with mod_php, mod_perl,
mod_python
• nginx + FCGI (for example, php-fpm)

[front/back]end
Heavy-weight server (HWS)

Light-weight server (LWS)

Apache
mod_php,
nginx mod_perl,
mod_python
FastCGI

«fast» and «slow» clients static content;
can do simple
scripting (SSI, perl) dynamic content

[front/back]end: scaling

B • homogeneous tiers
(maintenance)
F
• round-robin balancing
B (weighted, WRRB)
• WRRB means there’s no
SLB F “state”
B • key to simplest horizontal
scaling:
6)don’t store any “state” on the
box
7)weak coupling
F B

Scaling

linear
Income

pe
rfo
rm
an
c
e

Spending

Scaling web tier
• Many servers – put front- and back-ends into one
box (much simpler maintenance)
• Don’t store states on these boxes
• Loose coupling
• any shared resource make boxes “coupled”
• share accurately
• Common errors
– common data via NFS (sessions, code) => local
copies, sessions in memcached
– heavy writes into shared db real-time => if possible,
async messages
– local cache => global cache

nginx: load balancing

upstream backend {
server backend1.example.com weight=5;
server backend2.example.com:8080;
server unix:/tmp/backend3;
}

server {
location / {
proxy_pass http://backend;
}
}

nginx: fastcgi
upstream backend {
server www1.lan:8080 weight=2;
server www2.lan:8080;
}
server {
location / {
fastcgi_index index.phtml;
fastcgi_param [param] [value]
...
fastcgi_pass backend;
}
}

Protected static files performance
• static files with restricted access
• you need some “logic” to check access rights
• scripting is expensive: “heavy” process for each
client
• X-Accel-Redirect: “heavy” process checks rights
quickly and returns a special header with filename
• URL-certificates: best practice, no scripting at all
• http://wiki.nginx.org/NginxHttpAccessKeyModule
• http://wiki.nginx.org/HttpSecureLinkModule

Caching
• «memory»-10-9-10-6,«network»-10-4,«disk»- slower 10-3
• 100% static (pages, images etc), HTML-blocks,
«objects»
• Complexity:
– if-modified-since (no request)
– proxy cache (cache data is stored on a web-server)
– object(serialized) cache (cache storage is used)
• Industry standard - memcached, also popular: Redis
(more than cache) and others

Local vs. Global cache
• memory utilization (very bad for huge clusters)
• incoherence
• intranet latency is small, use global in-memory cache

LC
backend
+ data
frontend
LC
backend
+ data

each backend talks to all global caches

Global Cache
Global Cache
Global Cache
Global Cache

Memcached
• danga.com/memcached/ (LiveJournal -> Facebook)
• shared cache-server
• fsm (libevent)
• memory slabs, items of 2N size
• ideal for sessions, object cache
• performance tips:
• small objects, zip other (CPU? use thresholds)
• multi-get
• stats (get, set, hit, miss + slab info)

Scaling cache
• global cache: how to map data to server?
• server = crc32(key)%N and variations
• problem adding new server: 100% miss (cold start)
• solutions
• 1. don’t use complex queries, flush caches
periodically to check if your cold start is still quick
(Badoo: cache cluster flush several times per year)
• 2. distribution tricks like Ketama
• years in production: old (slow) and new (fast) boxes
• several daemons over one machine
• virtual buckets

Advanced topic (PHP-only)
• can skip
• will be useful for PHP-developers only
• covers PHP-FPM, initially developed
in Badoo
• 6 slides, cover or skip?

PHP
• use acceleration: APC, xcache, ZPS,
eAccelerator
• PHP is quite hungry for memory & CPU
• C: 1M
• Perl: 10M
• PHP: 20M
• FCGI (fpm)

PHP-FPM
• PHP-FPM: PHP FastCGI process manager
• server architecture close to nginx (master + N workers)
• happy production requirements:
• non-stop live binary upgrades and configuration
• see all errors
• react on suspicious worker behavior (latency, mass
death)
• dynamic pools (mostly useful for shared hosting)

PHP-FPM: basic features
• graceful reload: live binaries & conf updates
• master process catches workers stderr – you’ll see
everything in logs
• slow workers auto-tracing & killing
• emergency auto-reload when massive workers crash is
detected

PHP-FPM: advanced features
• fatal blank page: header will NOT be 200 on fatals
• fastcgi_finish_request() – give output to client and
continue (sessions, stats etc)
• accelerated upload support (request_body_file - nginx
0.5.9+)
• groups: highload-php-(en|ru)@googlegroups.com

flipchart session

• Questions?
• Case#1: knowledge base (like wikipedia)
• Case#2: media-storage (photo-video-
hosting, file-sharing etc)

Imagine you are… a database 
• and you’re doing SELECT
• rough approximation
• establish connection, allocate resources (speed,
memory-per-connection on server side)
• read the query
• check query cache (if enabled, memory,
invalidation)
• cont. on the next slide …

SELECT (cont.)
• parse query (CPU, bind vars, stored procs)
• “get data” (index lookup, buffer cache, disk
reads)
• “sort data” (or just read sorted!)
• in-memory, filesort, key buffer
• output, clean up, close conn…

SELECT: resume
• many steps and details
• every step uses some “resource”
• the principal feature of relational databases
was that you just need to know SQL to talk to
them
• bad news: we have to know much more to
tune databases

So, MySQL performance (1/3)
• Many engines - MyISAM, InnoDB,
Memory(Heap); Pluggable
• Locking: MyISAM table-level, InnoDB row-level
• «manual» locks: select get_lock, select for
update
• Indices: B-TREE, HASH (no BITMAP)
• point->rangescan->fullscan;
• fully matching prefix; innoDB PK: clustering,
coverage(“using index”);
• disk fragmentation

MySQL performance (2/3)
• myisam key cache, innodb buffer pool
• dirty buffers and transaction logs:
innodb_flush_trx_log_at_commit
• many indexes – heavy updates
• sorting: in-memory (sort buffers), filesort

MySQL performance (3/3)
• USE EXPLAIN
• Extra: using temporary, using filesort
• innodb_flush_method = O_DIRECT
• alters can be heavy: use many small tables instead of
big one
• partitioning

MySQL common practices
• applications: OLAP and OLTP
• OLAP – MyISAM (Infobright and other column-
based)
• OLTP – InnoDB
• imagine you are database
• what operations will be executed?
• need all of them?
• replace heavy operations by others lighter
• don’t be afraid of denormalization
• think about scaling from the very beginning

Denormalization
• remove extra join
• remove sorting
• remove grouping
• remove filtering
• make materialized views
• very many other things …
• Examples
• Counters
• Trees in databases: materialized path
• Inverted search index

Other tips and tricks
• multi-operations
• On duplicate update
• table switching (rename)
• memory tables as a temporary storage
• updated = updated

Scaling databases
• we want
• linear scalability
• easy support
• many people start with replication
• replication is not bad, but it’s limited
• “true” scale-out solution is only sharding

Scaling databases
• vertical splitting: by tasks (tables)
• put tables used together on another box
• horizontal: by primary entities (users,
documents)
• split one table into many small and move them
to other boxes

Replication basics
• single server, writes/reads << 1
• adding new one, more power to read
• in the beginning ~100% growth (linear)
• writes still go to the master, writes are not
scaled
• more servers – less efficiency
• higher writes/reads factor – less efficiency
• social networks, UGC – many writes

Replication problems
• close to linear only in the very beginning
• copies: ineffective disk and memory
(buffer pool, fs cache) utilization
• MySQL particularities: serving slaves,
processed by one-thread etc.

G: 1) bigger for heavier writes
2) bigger for write-intensive applications

Sharding
• spread writes along all database nodes and achieve
true scale-out
• what attribute to choose to shard by?
• how to address data to the shard?
• how to keep unique keys along the whole system?
• how to query data from multiple nodes? how to run
analytical queries?
• how to re-shard?
• how to back-up?

Mapping data to shard
• primary attribute: user_id, document_id …
• unmanaged: id -> hash%N -> server
• better: virtual buckets
• id -> hash%N -> bucket -> [C] -> server
• buckets: user -> bucket is determined by formula
• best, “dynamical”: user -> bucket can be configurable
• “dynamical”: id -> [C1] -> bucket -> [C2] -> server
• configuration: C1 – “dynamical”, C2 – almost static

Sharding topology
• Two main patterns:
– proxy: hides sharding logic
– coordinator: just tells exactly where to go
• proxy
• harder to build from scratch
• easy to write apps
• coordinator
• easier to build from scratch
• relatively harder to use
• architecture doesn’t hide anything and provokes
developers to learn internals

Dynamical mapping
• ID -> {map 1} -> bucket -> {map 2} -> server
• “coordinates”
• datacenter
• server
• schema
• table
• mapping:
• ID -> {bucket}
• {bucket} = {server, schema, table}
• 42 = {db15.dc3, Shard7, User33}
• 42 = {30015, 7, 33}
• almost “static” (changes rarely: re-sharding)

Dynamical mapping
Where?
WebApp Coordinator
Node # 1234

data

Storage nodes

Case#3: Sharding
• flipchart!
• most difficult part of tutorial
• don’t hesitate to ask questions
• additional questions to answer:
• how to query data from multiple nodes?
• how to run analytical queries?
• how to re-shard?
• how to back-up?

MySQL in Badoo (1/3)
• minus in theory – plus in practice
• they say MySQL is “stupid”
• while this usually means that
– MySQL doesn’t allow complex dependencies
– so MySQL just doesn’t dictate ineffective
architecture
– no rocket science to build a system for millions
users, thousands boxes, on commodity servers

• InnoDB
• avoid complex queries
• no FK, triggers or procedures
• homemade sharding, replication, upgrade
automation
• virtual coordinate shard_id mapped to physical
coordinates {serverX, dbY, tableZ}


• no “transparent” proxies that “hide” architecture
• clients are routed dynamically
• queues – MySQL (transaction-based events), also
used Scribe, RabbitMQ
• didn’t change architecture during 6 years from 0 to
130 M users

Queues

• If we can do something later – client shouldn’t wait
• While sharding is “separation in space”, queueing
is “separation in time”
• Will cover basics and show how to build such a
component

Distributed communications

• RPC = Remote procedure calls
• MQ = message queues
• Synchronous: remote services
• Asynchronous: queues
• Bunch of ready standalone products
• Generated-by-transactions queues
• Standalone systems and transactional
integrity problem

RPC/MQ: concept
RPC: synchronous, “point-to-point”

request

“client” result “server”

MQ: asynchronous, “publisher-subscriber”

Message
“client” “server” Queue
Consumers
message
(jobs)

Database-driven MQ

“publisher” “subscriber”
database

• transaction integrity
• relatively slow
• mostly used for transaction-based queues
• hundreds event/sec on shard server is OK
• subscribers: event dispatching

Case#4: MySQL-based queues

• flipchart!
• model, event processing, failover,
scaling
• decentralized queues

Development + support = 100%
100%
• small projects
• project just started
Development (time)

«dynamical» projects

Tired projects

Support (time)

100%

Monitoring
• server monitoring is useless for strategic analysis

• good monitoring
• connects “business” and “technical” values
• visualizes flows between sub-systems
• helps to optimize flows
• generally, helps to make right decisions

• user -> (something complex) -> servers -> monitoring

• in a big system you can’t “reconstruct” flows from server
monitoring

Lean way
• users make requests, that’s all
• latency (how long request is processed on server)
• for various apps (scripts)
• statistics: not just average
• internal “structure” of a request
• what sub-systems are used to process the query
• what is the impact of these sub-systems into the
latentcy
• requests per second
• for various sub-systems

Maintenance

• Latency/RPS by server (server group,
datacenter …)
• Real-time
• CPU usage by apps (scripts)
• What changes with new releases

PINBA
• PHP extension handles “start” and “finish” for
every request
• Collects script_name, host, time, rusage …
• Send UDP on request shutdown
• From all your web-cluster
• Listener/server thread in MySQL (v. 5.1.0+)
• SQL-interface to all the data

PINBA: client data

• request: script_name, host, domain, time,
rusage, peak memory, output size, timers
• timers: time + “key(tag) – value” pairs
• example:
– 0.001 sec
– {group => “db::update”, server => “dbs42”}

PINBA: server data
• SQL: “raw” data or reports
• Reports – separate tables, updated real-time
• Base reports (~10): general, by scripts, by host+script
pairs…
• Tag reports CREATE TABLE R … (ENGINE=PINBA
COMMENT='report:foo,bar‘)
• R: {script_name, foo_value, bar_value, count, time}
• http://pinba.org – many examples
• 2012 – added nginx module for HTTP statuses

Pinba: real-time monitoring

req/sec

average time

• Scripts
• Virtual hosts
• Physical servers

No we know: scripts, times, periods – know where to dig

Year passes, code rottens
The law: usage grows until you start refactoring

Memcached stats
• Traditional stats
– Req/sec
– Hit/miss
– Bytes read/written
• Stats slabs
• Stats items
• Stats cachedump

Cachedump (1/4)
17th slab = 128 K

stats cachedump 17
ITEM uin_search_ZHJhZ29uXzIwMDM0QGhvdG1haWwuY29t [65470
b; 1272983719 s]
ITEM uin_search_YW5nZWw1dHJpYW5hZEBob3RtYWlsLmNvbQ==
[65529 b; 1272974774 s]
ITEM unreaded_contacts_count_55857620 [83253 b; 1272498369 s]
ITEM antispam_gui_1676698422010-04-17 [83835 b; 1271677328 s]
ITEM antispam_gui_1708317782010-04-15 [123400 b; 1271523593 s]
ITEM psl_24139020 [65501 b; 1271335111 s]
END

Cachedump (2/4)
• Extract group name from cachedump
• See size distributions, find anomalies
• Or, just see some stupid errors
• Or, make decisions
– time to switch on compression
– split objects into parts
• Big object for memcached is evil

Cachedump (3/4)

• Extract group name from
cachedump
• See access time distribution
• You can play with lifetime
• T lifetime >> T access time ?
– Decrease lifetime for this group

Cachedump (4/4)
• Can be very slow
• Buggy (at least old versions)
• Treat results as statistical samples
• Or increase crazy static buffer in
source codes

auto debug & profiling (1/1)
• How to profile the code?
• Callgrind & co – good, but too much data, 99.99%
useless
• Reduction of dimension: measure potentially slow parts
only (IO: disk ops, remote queries – db, memcached,
с/c++, …)
• Timers in PINBA
• Adding summary: average time, CPU, remote queries by
group
• Devel: always add this to the end of every page
• Production: can be written to logs

Auto debug & profiling (2/2)
• What happens between sub-systems
• «cost» visualization
• Easy to find non-trivial bugs:
– No dbq->memq with refresh
– Many gets instead of multi-get (or, many inserts instead
or multi-insert et cetera)
– complex inter-server transactions
– Many connections to one and the same server
(database, …)
– cache-set when database is down or error occurred
– reading from slave what was just written to the master
– many more…

What’s missed
• Component stats: MySQL, apache, nginx…
• Server monitoring
• Client side stats (DOM_READY, ON_LOAD) –
very important
• Errors

Spasibo!
• 6. Questions session
• alexey.rybak@gmail.com
• a.rybak@corp.badoo.com
• Please fill the feedback form: electronic
(http://alexeyrybak.com/devconf2012.html) or paper
(available at my desk). Put your email and I'll send you
this presentation.
• Please give me your feedback, especially critical

Large-scale projects development (scaling LAMP)

More Related Content

What's hot

Similar to Large-scale projects development (scaling LAMP)

Recently uploaded

Large-scale projects development (scaling LAMP)