Distributed Postgres

Distributed postgres.
XL, XTM, MultiMaster
Stas Kelvich

Started about a year ago.
Konstantin Knizhnik, Constantin Pan, Stas Kelvich
Cluster group in PgPro
2

Started to playing with Postgres-XC. 2ndQuadrant also had project
(ﬁnished now) to port XC to 9.5.
Fork is painful;
How can we bring functionality of XC in core?
Cluster group in PgPro
3

Distributed transactions - nothing in-core;
Distributed planner - fdw, pg_shard, greenplum planner (?);
HA/Autofailover - can be built on top of logical decoding.
Distributed postgres
4

Achieve proper isolation between tx for multi-node transactions.
Now in postgres on write tx start:
Aquire XID;
Get list of running tx’s;
Use that info in visibility checks.
Distributed transactions
5

transam/clog.c:
GetTransactionStatus
SetTransactionStatus
transam/varsup.c:
GetNewTransactionId
ipc/procarray.c:
TransactionIdIsInProgress
GetOldestXmin
GetSnapshotData
time/tqual.c:
XidInMVCCSnapshot
XTM API:
vanilla
6

transam/clog.c:
transam/varsup.c:
GetNewTransactionId
ipc/procarray.c:
GetOldestXmin
GetSnapshotData
time/tqual.c:
XidInMVCCSnapshot
Transaction
Manager
XTM API:
after patch
7

transam/clog.c:
transam/varsup.c:
GetNewTransactionId
ipc/procarray.c:
GetOldestXmin
GetSnapshotData
time/tqual.c:
XidInMVCCSnapshot
Transaction
Manager
pg_dtm.so
XTM API:
after tm load
8

Aquire XID centrally (DTMd, arbiter);
No local tx possible;
DTMd is a bottleneck.
XTM implementations
GTM or snapshot sharing
9

Paper from SAP HANA team;
Central daemon is needed, but only for multi-node tx;
Snapshots -> Commit Sequence Number;
DTMd is still a bottleneck.
XTM implementations
Incremental SI
10

XID/CSN are gathered from all nodes that participates in tx;
No central service;
local tx;
possible to reduce communication by using time (Spanner,
CockroachDB).
XTM implementations
Clock-SI or tsDTM
11

XTM implementations
tsDTM scalability
12

More nodes, higher probability of failure in system.
Possible problems with nodes:
Node stopped (and will not be back);
Node was down small amount of time (and we should bring it
back to operation);
Network partitions (avoid split-brain).
If we want to survive network partitions than we can have not more
than [N/2] - 1 failures.
HA/autofailover
13

Possible usage of such system:
Multimaster replication;
Tables with metainformation in sharded databases;
Sharding with redundancy.
HA/autofailover
14

By Multimaster we mean strongly coupled one, that acts as a single
database. With proper isolation and no merge conﬂicts.
Ways to build:
Global order to XLOG (Postgres-R, MySQL Galera);
Wrap each tx as distributed – allows parallelism while applying
tx.
Multimaster
15

Our implementation:
Built on top of pg_logical;
Make use of tsDTM;
Pool of workers for tx replay;
Raft-based storage for dealing with failures and distributed
deadlock detection.
Multimaster
16

Our implementation:
Approximately half of a speed of standalone postgres;
Same speed for reads;
Deals with nodes autorecovery;
Deals with network partitions (debugging right now).
Can work as an extension (if community accept XTM API in
core).
Multimaster
17

Distributed Postgres

More Related Content

What's hot

Viewers also liked

Similar to Distributed Postgres

Recently uploaded

Distributed Postgres