Distributed Database System
A system involving multiple sites connected together via
communication network.
User at any site can access data stored at any site as if the data
were all stored at the user’s own site.
Each site is a database system in its own right: its own local
database, local users, local DBMS, local DC manager.
DBMS
Communication
Network
A typical distributed database system
database
User Communication
manager
3.
Distributed Database System
(cont.)
Two types
Homogeneous- Same copy of DBMS
Heterogeneous- Different DBMSs
A Fundamental Principle
To the user, a distributed system should look exactly
like a non-distributed system.
4.
Advantages
Enables thestructure of database to mirror that of
the enterprise
Data are located near “greatest demand” site
Faster data access
Faster data processing
increased accessibility
Improved communications
Less danger of a single-point failure
Query Optimization
5.
Disadvantages
Complexity ofimplementation, management and
control
Security
Lack of standards
Increased storage requirements
Query Optimization
6.
Distributed Database System
(cont.)
Sample Systems
Prototypes
SDD-1: Computer Corporation of America (CCA),
late 1970s and early 1980s.
R*: IBM, early 1980s
Distributed INGRES: Berkeley, early 1980s.
Commercial Products :
INGRES / STAR: Relational Technology Inc.
SQL*STAR: Oracle Corp.
DB2 version 2 release 2: IBM
SQL server: Microsoft
7.
The Twelve Objectives
1.Local Autonomy
all operations at a given site are controlled by that site,
should not depend on other sites.
local data is locally owned and managed.
Not wholly achievable => sites should be autonomous to the
maximum extend possible.
2. No Reliance on a Central Site
all sites must be treated as equals.
the central site may be bottleneck.
3. Continuous Operation
Reliability
Availability
Never require the system to be shutdown to perform some
function: e.g. add a new site.
8.
The Twelve Objectives(cont.)
4. Location Independence ( Location Transparency )
user should not need to know at which site the data is stored, but
should be able to behave as if the entire database were stored at
their own local site.
a request for some remote data => system should find the data
automatically.
Advantages
<1> Simplify user programs and activities
<e.g.>
<2> allow data to be moved from one site to another at any
time without invalidating any program or activities.
SELECT S#
FROM S
AT SITE A
WHERE SNAME = 'John'
A
B
C
9.
The Twelve Objectives(cont.)
5. Fragmentation Independence ( Fragmentation
Transparency )
Data Fragmentation
a given local object can be divided up into pieces
(fragments) for physical storage purpose.
<e.g.> user perception
New York
fragment
EMP# DEPT# SALARY
E1 DX 45K
E2 DY 40K
E3 DZ 50K
E4 DY 63K
E5 DZ 40K
EMP
EMP# DEPT# SALARY
E1 DX 45K
E3 DZ 50K
E5 DZ 40K
EMP# DEPT# SALARY
E2 DY 40K
E4 DY 63K
London
fragment
physical storage
New York
physical storage
London
An example of
fragmentation.
10.
The Twelve Objectives(cont.)
Data Fragmentation
A fragmentation can be any subrelation derivable via
restriction and projection (with primary key).
Advantage: data can stored at the location where it is most
frequently used.
Fragmentation independence
user should be able to behave as if the relations were not
fragmented at all.
one reasons why relational technology is suitable for DBMS.
user should be presented with a view of data.
=> system must support updates against join and union views.
Advantages
(1) simplify user program and activity.
(2) allow data to be re-fragmented at any time.
11.
The Twelve Objectives(cont.)
6. Replication Independence ( Replication
Transparency )
Data Replication
USER PERCEPTION
EMP# DEPT# SALARY
E1
E2
E3
E4
E5
DX
DY
DZ
DY
DZ
45K
40K
50K
63K
40K
EMP
An example of replication.
EMP# DEPT# SALARY
E1
E3
E5
DX
DZ
DZ
45K
50K
40K
EMP# DEPT# SALARY
E2
E4
DY
DY
40K
63K
EMP# DEPT# SALARY
E2
E4
DY
DY
40K
63K
EMP# DEPT# SALARY
E1
E3
E5
DX
DZ
DZ
45K
50K
40K
New York fragment London fragment
replica of London fragment replica of New York fragment
physical storage
New York
physical storage
London
copy
12.
The Twelve Objectives(cont.)
Data Replication
A given fragment of relation can be represented at the
physical level by many distinct copies of the same object at
many distinct sites.
Unit of replication: fragment (may not a complete relation)
Advantage: better performance and availability
Disadvantage: update propagation problem.
Replication Independence
User should be able to behave as if the data is not replicated
at all.
Advantages
(1) simplify user programs and activities.
(2) allow replicas to be created and destroyed dynamically.
13.
The Twelve Objectives(cont.)
7. Distributed Query Processing
message transfer cost
optimization
8. Distributed Transaction Management
concurrency control
recovery control
9. Hardware Independence: IBM, DEC, HP, PC, ...
10. Operating System Independence: VMS, UNIX, ...
11. Network Independence: BITNET, INTERNET,
ARPANET, ...
12. DBMS Independence: Relational, hierarchical, network, ...
distributed system may be heterogeneous.
14.
Problems of Distributed
DatabaseSystems
Basic point: network are slow !
Overriding Objective : minimize the number and
volume of messages.
Give rise to the following problem
Query Processing
Update Propagation
Concurrency
Recovery
Catalog Management
.
.
.
15.
Query Processing: Example
Query Optimization is more important in a
distributed system.
Example
Database:S ( S#, CITY ) 10,000 tuples, stored at site A.
P ( P#, COLOR) 100,000 tuples, stored at site B.
SP ( S#, P# ) 1,000,000 tuples, stored at site A.
Assume each tuple is 100 bits long.
Site A: S SP Site B: P
16.
Query Processing: Example(cont.)
Query: "Select S# for London suppliers of Red Parts"
Estimates
# of Red Parts = 10
# of Shipments by London Supplier = 100,000
Communication Assumption :
Data Rate = 10,000 bits per second
Access Delay = 1 second
T[i] = total communication time for strategy i
= total access delay + total data volume / data rate
= (# of messages * 1 sec) + (total # of bits / 10,000 ) sec.
SELECT S.S#
FROM S, P, SP
WHERE S.CITY = "London"
AND S.S# = SP.S#
AND SP.P# = P.P#
AND P.COLOR = 'Red'
site A site B
S, SP P
S SP
17.
Query Processing: Example(cont.)
Strategy 1
1. Join S and SP at site A
2. Select tuples from ( S SP ) for which city is 'London'
( 100,000 tuples )
3. For each of those tuple, check site B to see if the part is
red. (2 messages: 1 query, 1 response)
T[1] = ( 100,000 * 2 ) * 1 = 2.3 days
Strategy 2
Move relations S and SP to site B and process the query at B.
T[2] = 2+(10,000+1,000,000)*100/10,000 = 28 hours
Strategy 3
Move relation P to site A and process the query at A
T[3] =1+(100,000*100) /10,000 = 16.7 min
site A site B
S, SP P
18.
Query Processing: Example(cont.)
Strategy 4
1. Select tuples from P where color is red. (10 tuples)
2. Check site A to see if there exists a shipment relating the
part to a London Supplier. ( 2*10 messages )
T[4] = 2*10*1 = 20 sec
Strategy 5
1. Select tuples from P where color is red (10 tuples)
2. Move the result to site A and complete the processing at A.
T[5] = 1 + ( 10*100) / 100,000 = 1.01 sec
Note: Each of the five strategies represents a plausible
solution, but the variation in communication time is
enormous.
site A site B
S, SP P
19.
Update Propagation
Basicproblem with data replication
An update to any given logical data object must be propagated to all
stored copies of that object.
some sites may be unavailable (because of site or network failure)
at the time of the update
=> Data is less available !
A possible Solution: Primary Copy (used in distributed INGRES)
one copy of each object is designated as the primary copy.
primary copies of different objects will generally be at different
sites.
Update Operation
1. Complete as soon as the primary copy has been updated.
2. Control is returned and the transaction can continue execute.
3. The site holding the primary copy broadcasts the update to all other
sites.
Further Problem: violation of the local autonomy objective.
20.
Concurrency
Most distributedsystems are based on locking .
Requests to test, set and release locks are messages overhead.
<e.g.>: Consider a transaction T that needs to update an object for which there
exists replicas at n remote sites.
Several orders of magnitude greater than in a centralized system.
Solution
adopt the primary copy strategy.
the site holding the primary copy of X handles all locking operations involving X.
1 lock request, 1 lock grant, n updates, n ack, and 1 unlock request (2n+3<=5n).
Problem : loss of local autonomy.
n lock requests
n lock grants
n update messages
n acknowledgments
n unlock requests
5n :
21.
Concurrency (cont.)
GlobalDeadlock Problem
Neither site can detect it using only information that is internal to
that site.
i.e. no cycles in two local wait-for-graph, but a cycle in the global.
<e.g.>
Global deadlock detection needs further communication overhead.
holes lock Lx
SITE X
T1x T2x
wait for T1x
to release Lx
holes lock Ly
T1y T2y
wait for T2y
to release Ly
wait for T1y
to complete
wait for T2x
to complete
22.
Recovery
Two-Phase Commit
important whenever a given transaction can interact with multiple,
independent 'resource manager' (site).
the transaction issues a single system-wide COMMIT (or ROLLBACK), which
is handled by a new system component: Coordinator.
The coordinator goes through the following two-phase process :
<1> Coordinator requests all participant to decide whether commit ( reply OK ) or rollback.
<2> If all participants reply OK, the coordinator broadcast 'COMMIT', otherwise broadcast
'ROLLBACK' , and all participants must obey.
Points arising
the coordinator function must be performed by different sites for different
transactions => No reliance on a central site.
the participants must do what is told by the coordinator => loss of local
autonomy.
Coordinator must communicate with participants => more messages and
more overhead.
No protocol guarantee that all participants will commit or rollback in
unison.
23.
Catalog Management
Contentsof catalog:
not only data regarding relations, indexes, users, etc,
but also all the necessary control information for
independence.
Where and How ?
1. Centralized: violate no reliance on a central site.
2. Fully replicated: loss of local autonomy.
3. Partitioned: nonlocal operations are very expensive.
4. Combination of (1) and (3): violate no reliance on a
central site.