Distributed postgres.
XL, XTM, MultiMaster
Stas Kelvich
Started about a year ago.
Konstantin Knizhnik, Constantin Pan, Stas Kelvich
Cluster group in PgPro
2
Started to playing with Postgres-XC. 2ndQuadrant also had project
(finished now) to port XC to 9.5.
Fork is painful;
How can we bring functionality of XC in core?
Cluster group in PgPro
3
Distributed transactions - nothing in-core;
Distributed planner - fdw, pg_shard, greenplum planner (?);
HA/Autofailover - can be built on top of logical decoding.
Distributed postgres
4
Achieve proper isolation between tx for multi-node transactions.
Now in postgres on write tx start:
Aquire XID;
Get list of running tx’s;
Use that info in visibility checks.
Distributed transactions
5
transam/clog.c:
GetTransactionStatus
SetTransactionStatus
transam/varsup.c:
GetNewTransactionId
ipc/procarray.c:
TransactionIdIsInProgress
GetOldestXmin
GetSnapshotData
time/tqual.c:
XidInMVCCSnapshot
XTM API:
vanilla
6
transam/clog.c:
GetTransactionStatus
SetTransactionStatus
transam/varsup.c:
GetNewTransactionId
ipc/procarray.c:
TransactionIdIsInProgress
GetOldestXmin
GetSnapshotData
time/tqual.c:
XidInMVCCSnapshot
Transaction
Manager
XTM API:
after patch
7
transam/clog.c:
GetTransactionStatus
SetTransactionStatus
transam/varsup.c:
GetNewTransactionId
ipc/procarray.c:
TransactionIdIsInProgress
GetOldestXmin
GetSnapshotData
time/tqual.c:
XidInMVCCSnapshot
Transaction
Manager
pg_dtm.so
XTM API:
after tm load
8
Aquire XID centrally (DTMd, arbiter);
No local tx possible;
DTMd is a bottleneck.
XTM implementations
GTM or snapshot sharing
9
Paper from SAP HANA team;
Central daemon is needed, but only for multi-node tx;
Snapshots -> Commit Sequence Number;
DTMd is still a bottleneck.
XTM implementations
Incremental SI
10
XID/CSN are gathered from all nodes that participates in tx;
No central service;
local tx;
possible to reduce communication by using time (Spanner,
CockroachDB).
XTM implementations
Clock-SI or tsDTM
11
XTM implementations
tsDTM scalability
12
More nodes, higher probability of failure in system.
Possible problems with nodes:
Node stopped (and will not be back);
Node was down small amount of time (and we should bring it
back to operation);
Network partitions (avoid split-brain).
If we want to survive network partitions than we can have not more
than [N/2] - 1 failures.
HA/autofailover
13
Possible usage of such system:
Multimaster replication;
Tables with metainformation in sharded databases;
Sharding with redundancy.
HA/autofailover
14
By Multimaster we mean strongly coupled one, that acts as a single
database. With proper isolation and no merge conflicts.
Ways to build:
Global order to XLOG (Postgres-R, MySQL Galera);
Wrap each tx as distributed – allows parallelism while applying
tx.
Multimaster
15
Our implementation:
Built on top of pg_logical;
Make use of tsDTM;
Pool of workers for tx replay;
Raft-based storage for dealing with failures and distributed
deadlock detection.
Multimaster
16
Our implementation:
Approximately half of a speed of standalone postgres;
Same speed for reads;
Deals with nodes autorecovery;
Deals with network partitions (debugging right now).
Can work as an extension (if community accept XTM API in
core).
Multimaster
17

Distributed Postgres