Towards a Scalable
Non-Blocking Coding Style
Dr. Cliff Click, Distinguished Engineer
Azul Systems
http://blogs.azulsystems.com/cliff
The Computer Revolution is Here
We already did the 0->1 cpu transition

Concurrent Programming is Now 'The Norm'
and hard to do
We're doing the 1->2 cpu transition

Scalable Concurrent Programming

is even harder
Time to think about the 2->N cpu transition

Here is a different way of
thinking about the problem

2008 JavaOneSM Conference | java.sun.com/javaone |

2
What is Non-Blocking Algorithm?
Formally:
• Stopping one thread will not prevent global progress
Less formally:
• No thread 'locks' any resource
• and then gets pre-empted by OS
• Or blocked in I/O, etc
• No 'critical sections', locks, mutexs, spin-locks, etc

Individual threads might starve

2008 JavaOneSM Conference | java.sun.com/javaone |

3
XXX-Free Hierarchy
Wait-Free Algorithms (the best)
• All threads complete in finite count of steps
• Low priority threads cannot block high priority threads
• No priority inversion possible
Lock-Free (this work)
• Every successful step makes Global Progress
• But individual threads may starve
• Hence priority inversion is possible
• No live-lock

Obstruction-Free
• A single thread in isolation completes in finite count of steps
• Threads may block each other
• Hence live-lock is possible

2008 JavaOneSM Conference | java.sun.com/javaone |

4
Motivation
Multi-core is now almost unavoidable
Larger core counts more common:
• 8+ (X86), 64 (Sun/ Rock), 768 (Azul, more coming)
Locking suffers serious contention issues
• Amdahl's Law, etc
Would like to write correct code without locks!
Obstruction-free can live-lock
• More prone with higher cpu count
• Or higher thread count
Wait-free algorithms behave the best
• But tend to be slow
• And are very hard to code
• Handful of people on the planet can write these

2008 JavaOneSM Conference | java.sun.com/javaone |

5
Scalable
Most large-CPU count shared-memory hardware is:
• Parallel-read, Independent-write
Multiple CPUs reading the same location is fast
• Free 'cache-hitting-loads'
Multiple CPUs writing to the same location serialize
• Speed limited to '1-cache-miss-per-write'
or '1-memory-bus-update-per-write'

Must avoid all CPUs writing same location
for independent operations
• i.e., no shared counters, single lock-words, etc
Classic reader/writer lock chokes w/ >100 CPUs
• Contention on single reader-count word limits scaling

2008 JavaOneSM Conference | java.sun.com/javaone |

6
Agenda
Motivation
A Scalable Non-Blocking Coding Style
Example 1: BitVector
Example 2: HashTable
Example 3: Nearly FIFO Queue
Summary

2008 JavaOneSM Conference | java.sun.com/javaone |

7
Parts we need...
An Array to hold all Data
• Fast parallel (scalable) access
Atomic-update on single Array Words
• java.util.concurrent.Atomic.*
• “No spurious failure CAS”
A Finite State Machine
• Replicated per array word (or small set of words)
• Use Atomic-Update to 'step' in the FSM
Construct algorithm from many FSM 'steps'
• Lock-Free: Each CAS makes progress
• CAS success is local progress
• CAS failure means another CAS succeeded
(global progress, local starvation)

2008 JavaOneSM Conference | java.sun.com/javaone |

8
How Big is the Array?
Don't answer that: Make array growable
• Resize array as needed
• Common operation for Collection classes
Support array resize via State Machine
• Really: array-copy while in use
• All array words are independent
• Copy is parallel, incremental, concurrent
But mostly operate without a copy-in-progress
• So the common situation is simple, fast

2008 JavaOneSM Conference | java.sun.com/javaone |

9
Concurrent Array Resize
Copy old Array into a new larger Array
The hard part during a resize operation:
• Copy without losing any late-writes to old Array
Fix: “mark” old Array words with no-more-updates flag
• Payload still visible through the “mark”
Updaters' of marked payload must
copy then update in new array
Readers' seeing mark must
copy then read in new array

2008 JavaOneSM Conference | java.sun.com/javaone |

10
Atomic Update
Need some form of Atomic-Update
• java.util.concurrent.atomic.Atomic*
Update 1 word IFF old-value is equal to expected-value
Generally Compare-And-Swap (CAS, Azul/Sparc/X86) or
Load-Linked / Store-Conditional (LL/SC, IBM)
Common Hardware Limitations
• LL/SC suffers from live-lock
• Both CAS & LL/SC can suffer spurious failure on some hardware
• Infinite spurious failures is live-lock(?)
• Finite failures fixed with spin loop
• Useful if CAS does not spuriously fail (e.g. Azul)
• Especially at high CPU count
• If 1000 CPUs attempt update, 1 should succeed

2008 JavaOneSM Conference | java.sun.com/javaone |

11
Atomic Update: Failure
CAS failure returns old value on most (all?) hardware?
• Old value is evidence CAS did not fail spuriously
• The “witness” - the “proof of failure”
• LL/SC never provides old value
The witness not available after the CAS
• Overwritten by another thread
JDK API mistake: witness turned into a boolean
• Hence failure-for-cause can not be
distinguished from spurious-failure

Hence must spin on CAS failure until see reason for failure
• Report either CAS success OR
• CAS failure-for-cause
Spinning builds a “No spurious failure CAS”
2008 JavaOneSM Conference | java.sun.com/javaone |

12
Towards A Scalable Lock-Free Coding Style
Big Array to hold Data
Parallel, Scalable read access
Concurrent writes via: CAS & Finite State Machine
• No JMM issues during Finite State Machine updates
• No locks, no volatile
Fast as a best-of-breed not-thread-safe implementation
• But as correct as thread-safe implementations
• Much faster than locking under heavy load
• No indirections in common case
• Directly reach main data array in 1 step
Resize as needed
• Copy Array to a larger Array on demand
• Use State Machine to help copy
• “Mark” old Array words to avoid missing late updates
2008 JavaOneSM Conference | java.sun.com/javaone |

13
Agenda
Motivation
A Scalable Non-Blocking Coding Style
Example 1: BitVector
Example 2: HashTable
Example 3: Nearly FIFO Queue
Summary

2008 JavaOneSM Conference | java.sun.com/javaone |

14
Example 1: BitVector
Size: O(max element)
• Auto-resizing
Supports concurrent insert, remove, test&set
Obvious implementation:
• Array of 'long' - 64-bit payload words
• Bit mask & shift accessors
How to 'mark' payload?
• Steal 1 bit out of 64
• MOD 63 to select index words – this example only
• (Actually: avoid slow MOD by moving every 64th
bit to recursive bitvector)

Code up in SourceForge, high-scale-lib

2008 JavaOneSM Conference | java.sun.com/javaone |

15
Example 1: BitVector
Basic get & test/set (using MOD)
boolean get( int x ) {
long[] A = _A;
int idx = x/63;
if( idx >= A.length)
return false;

boolean test_set( int x ) {
long[] A = _A; // read once
int idx = x/63;
if( idx >= A.length )
return grow(x);
while( true ) { // spin loop
int old = A[idx];
int old = A[idx];
if( old < 0 )
if( old < 0 ) // marked?
return copy(x).get(x);
return copy(x).test_set(x);
long mask = 1L <<(x%63);
long mask = 1L <<(x%63);
return (old & mask)!=0;
if( (old & mask) != 0)
}
return true;
if( CAS(A[idx],old,old|mask))
return false;
}
2008 JavaOneSM Conference | java.sun.com/javaone |

16
Example 1: BitVector
Read Array once – it may change out from under us!
boolean get( int x ) {
long[] A = _A;
int idx = x/63;
if( idx >= A.length)
return false;

boolean test_set( int x ) {
long[] A = _A; // read once
int idx = x/63;
if( idx >= A.length )
return grow(x);
while( true ) { // spin loop
int old = A[idx];
int old = A[idx];
if( old < 0 )
if( old < 0 ) // marked?
return copy(x).get(x);
return copy(x).test_set(x);
long mask = 1L <<(x%63);
long mask = 1L <<(x%63);
return (old & mask)!=0;
if( (old & mask) != 0)
}
return true;
if( CAS(A[idx],old,old|mask))
return false;
}
2008 JavaOneSM Conference | java.sun.com/javaone |

17
Example 1: BitVector
Out-of-bounds triggers resize
boolean get( int x ) {
long[] A = _A;
int idx = x/63;
if( idx >= A.length)
return false;

boolean test_set( int x ) {
long[] A = _A; // read once
int idx = x/63;
if( idx >= A.length )
return grow(x);
while( true ) { // spin loop
int old = A[idx];
int old = A[idx];
if( old < 0 )
if( old < 0 ) // marked?
return copy(x).get(x);
return copy(x).test_set(x);
long mask = 1L <<(x%63);
long mask = 1L <<(x%63);
return (old & mask)!=0;
if( (old & mask) != 0)
}
return true;
if( CAS(A[idx],old,old|mask))
return false;
}
2008 JavaOneSM Conference | java.sun.com/javaone |

18
Example 1: BitVector
'Mark' triggers copy & retry
boolean get( int x ) {
long[] A = _A;
int idx = x/63;
if( idx >= A.length)
return false;

boolean test_set( int x ) {
long[] A = _A; // read once
int idx = x/63;
if( idx >= A.length )
return grow(x);
while( true ) { // spin loop
int old = A[idx];
int old = A[idx];
if( old < 0 )
if( old < 0 ) // marked?
return copy(x).get(x);
return copy(x).test_set(x);
long mask = 1L <<(x%63);
long mask = 1L <<(x%63);
return (old & mask)!=0;
if( (old & mask) != 0)
}
return true;
if( CAS(A[idx],old,old|mask))
return false;
}
2008 JavaOneSM Conference | java.sun.com/javaone |

19
Example 1: BitVector
Failed CAS must retry – BUT!
• Means another thread made progress
boolean get( int x ) {
long[] A = _A;
int idx = x/63;
if( idx >= A.length)
return false;

boolean test_set( int x ) {
long[] A = _A; // read once
int idx = x/63;
if( idx >= A.length )
return grow(x);
while( true ) { // spin loop
int old = A[idx];
int old = A[idx];
if( old < 0 )
if( old < 0 ) // marked?
return copy(x).get(x);
return copy(x).test_set(x);
long mask = 1L <<(x%63);
long mask = 1L <<(x%63);
return (old & mask)!=0;
if( (old & mask) != 0)
}
return true;
if( CAS(A[idx],old,old|mask))
return false;
}
2008 JavaOneSM Conference | java.sun.com/javaone |

20
Example 1: BitVector
Almost as fast as plain BitVector
• Normal load & mask for get/set
• Range check
• Extra '<0' test (triggers copy & retry)
• Set uses CAS spin-loop
Copy: Sign-bit to stop further updates
• Use CAS to set sign-bit
• Then copy word to new array
• Repeat operation on new array
Finite State Machine!
• per Array word
• Hidden in the code
Let's make the FSM obvious...
2008 JavaOneSM Conference | java.sun.com/javaone |

21
BitVector State Machine
0000
“initial”

2008 JavaOneSM Conference | java.sun.com/javaone |

22
BitVector State Machine
set & clear
0000

set

A: Normal operations

0XXX
“active”

2008 JavaOneSM Conference | java.sun.com/javaone |

23
BitVector State Machine
set & clear
0000

set

A: Normal operations

0XXX

Out-of-Bounds set
triggers resize!

old array
new array
0000
“initial”

2008 JavaOneSM Conference | java.sun.com/javaone |

24
BitVector State Machine
set & clear
0000

set

A: Normal operations

0XXX
mark B: Mark to prevent further updates
1XXX
“marked”

old array
new array

0000

2008 JavaOneSM Conference | java.sun.com/javaone |

25
BitVector State Machine
set & clear
0000

set

A: Normal operations

0XXX
mark B: Mark to prevent further updates
1XXX
old array
new array
0000

copy

0XXX

C: Copy from old to new
2008 JavaOneSM Conference | java.sun.com/javaone |

26
BitVector State Machine
set & clear
0000

set

A: Normal operations

0XXX
mark B: Mark to prevent further updates
1XXX
D: Memory-fence between arrays
0000

copy

old array
new array

0XXX

C: Copy from old to new
2008 JavaOneSM Conference | java.sun.com/javaone |

27
BitVector State Machine
set & clear
0000

set

A: Normal operations

0XXX
mark B: Mark to prevent further updates

1000

copy
done

1XXX

“copy-done”
E: Signal copydone in old table

D: Memory-fence between arrays
0000

copy

old array
new array

0XXX

C: Copy from old to new
2008 JavaOneSM Conference | java.sun.com/javaone |

28
BitVector State Machine
set & clear
set

0000

0XXX
mark B: Mark to prevent further updates

mark
1000

A: Normal operations

copy
done

E: Signal copydone in old table

1XXX
C: Memory-fence between arrays
0000

copy

old array
new array

0XXX

D: Copy from old to new
2008 JavaOneSM Conference | java.sun.com/javaone |

29
Resize - motivation
Triggered by adding larger element
Copy each word before get/put
Pay indirection even after copy
• Visit old table, fence, operate on new table
So need to copy all words eventually, and then
Promote: make new array the top-level array
• No more indirection
Policy? How to copy all words?
• Visiting threads can “copy some words”
• Or background threads copy, or only-writers, etc
• Good standard engineering, nothing special

2008 JavaOneSM Conference | java.sun.com/javaone |

30
Resize – Copy Mechanics
Helper: any thread copying words it does not directly
need
Helpers CAS-up a “promise to copy” counter
• Atomic-increment by fixed N (e.g. 16 words)
Helpers copy words via State Machine
Helpers atomic-increment “done work” counter
• On transition to “copy-done” state
Promote new Array when “done work” == A.length
What If: Helper stalled? (promises but never copies)
• Allow helpers to “double-promise”!
• Worst case: each thread can complete entire copy
Eventually, copy completes & array promotes
2008 JavaOneSM Conference | java.sun.com/javaone |

31
Coding Style Elements
Large array for parallel read & update
• No JMM issues for read or update (no lock, no volatile)
State Machine per-array-word
• Successful CAS is FSM transition
• Failed CAS causes retry
• (but another thread made progress)

'Mark' payload words to stop 'late updates'
Array copy for Resize
• Copy is parallel, incremental, concurrent
• Copy part of State Machine
• Unrelated threads can make progress during resize
• Fence between old and new tables

2008 JavaOneSM Conference | java.sun.com/javaone |

32
Agenda
Motivation
A Scalable Non-Blocking Coding Style
Example 1: BitVector
Example 2: HashTable
Example 3: Nearly FIFO Queue
Summary

2008 JavaOneSM Conference | java.sun.com/javaone |

33
Example 2: HashTable
Array of K/V Pairs
• Keys in even slots, Values odd slots
• CAS each word separately, but FSM spans both words
• Value can also be 'Tombstone'
• Key & Value both start as null
Mark payload by 'boxing' values
Copy on resize, or to flush stale keys
Supports concurrent insert, remove, test, resize
Linear scaling on Azul to 768 CPUs
• More than billion reads/sec simultaneous with
• More than 10million updates/sec
Code up in SourceForge, high-scale-lib
• Passes Java Compatibility Kit (JCK) for ConcurrentHashMap
2008 JavaOneSM Conference | java.sun.com/javaone |

34
“Uninteresting” Details
Good, standard engineering – nothing special
Closed Power-of-2 Hash Table
• Reprobe on collision
• Stride-1 reprobe: better cache behavior
• (complicated argument about 2n vs prime goes here)
Key & Value on same cache line
Hash memoized
• Should be same cache line as K + V
• But hard to do in pure Java
No allocation on get() or put()
Auto-Resize

2008 JavaOneSM Conference | java.sun.com/javaone |

35
HashTable State Machine
0/0
“initial”

•Inserting K/V pair
•Already probed table, missed
•Found proper empty K/V slot
•Ready to claim slot for this Key

2008 JavaOneSM Conference | java.sun.com/javaone |

36
HashTable State Machine
0/0
insert
key

K/0
“bare key”

Claim key slot

2008 JavaOneSM Conference | java.sun.com/javaone |

37
HashTable State Machine
0/0
insert
key

K/V
insert V

“active”

K/0
Initial set of Value

2008 JavaOneSM Conference | java.sun.com/javaone |

38
HashTable State Machine
0/0
insert
key

K/V
insert V
K/0

delete
K/T

Delete uses
'tombstone' value;
Key remains

“deleted”

2008 JavaOneSM Conference | java.sun.com/javaone |

39
HashTable State Machine
change V
0/0
insert
key

K/V
insert V
K/0

delete
re-insert

Change Value uses
same key slot
Re-insert uses
same key slot

K/T
“deleted”

2008 JavaOneSM Conference | java.sun.com/javaone |

40
HashTable State Machine
change V
0/0

K/V

insert
key

insert V
K/0

delete
re-insert
K/T

Resize triggered,
new array created

old array
new array

0/0
“initial”
2008 JavaOneSM Conference | java.sun.com/javaone |

41
HashTable State Machine
change V
0/0

K/V

insert
key

insert V
K/0

box
delete
re-insert

K/[V]
“boxed V”

K/T
Boxing V prevents
further changes
old array
new array
0/0

2008 JavaOneSM Conference | java.sun.com/javaone |

42
HashTable State Machine
change V
0/0

K/V

insert
key

insert V
K/0

box
delete
re-insert

K/[V]

K/T
Claim key slot
in new table

old array
new array

0/0

insert
key

K/0
“bare key”
2008 JavaOneSM Conference | java.sun.com/javaone |

43
HashTable State Machine
change V
0/0

K/V

insert
key

insert V
K/0

box
delete
re-insert

K/[V]

K/T

old array

Copy in new table
without box

new array
0/0

insert
key

K/0

copy

K/V
“active”
2008 JavaOneSM Conference | java.sun.com/javaone |

44
HashTable State Machine
change V
0/0

K/V

insert
key

insert V
K/0

box
delete
re-insert

K/[V]

K/T
Fence after writing to new array
and before setting 'copy done'
old array

Memory-fence between arrays
0/0

insert
key

K/0

new array
copy

K/V

2008 JavaOneSM Conference | java.sun.com/javaone |

45
HashTable State Machine
change V
0/0

K/V

insert
key

insert V
K/0

box
delete
re-insert

K/[V]
copy
done

K/T

K/[T]
old array

Memory-fence between arrays
0/0

insert
key

K/0

new array
copy

K/V

“copy done”
Final state: “new
Array has Value”

2008 JavaOneSM Conference | java.sun.com/javaone |

46
HashTable State Machine
change V
0/0

K/V

insert
key

insert V
K/0

box
delete
re-insert

K/[V]
copy
done

K/T
Nothing
to copy
Memory-fence between arrays

K/[T]
old array
new array

0/0

2008 JavaOneSM Conference | java.sun.com/javaone |

47
HashTable State Machine
change V
0/0

K/V

insert
key

insert V
K/0

box
delete
re-insert

K/[V]
copy
done

K/T
Copy stops
partial insertion
Memory-fence between arrays

K/[T]
old array
new array

0/0

2008 JavaOneSM Conference | java.sun.com/javaone |

48
HashTable State Machine
change V
0/0

K/V

insert
key

insert V
K/0

box
delete
re-insert

K/[V]
copy
done

K/T

K/[T]
old array

Memory-fence between arrays
0/0

insert
key

K/0

new array
copy

K/V

2008 JavaOneSM Conference | java.sun.com/javaone |

49
Agenda
Motivation
A Scalable Non-Blocking Coding Style
Example 1: BitVector
Example 2: HashTable
Example 3: Nearly FIFO Queue
Summary

2008 JavaOneSM Conference | java.sun.com/javaone |

50
Example 3: Nearly FIFO Queue
Concurrent near-FIFO Queue
• e.g. producer / consumer worklist
• Producers & consumers are large thread pools
Scaling bottleneck:
• Locking or single word CAS on push & pop
Could stripe Queue:
• Many short Queues
• Select random Queue
• Many different locks or many different words to CAS
• Less contention
• Pick at random to push or pop
• Must search all queues for not-full or not-empty

2008 JavaOneSM Conference | java.sun.com/javaone |

51
Example 3: Nearly FIFO Queue
1000's of CPUs need 1000's of Queues
• Stripe Ad-Absurdum
• Queues get ever-smaller
• Get down to Queues of 1 entry
Single-entry Queue: either full or empty
• Implement as a single word
• Either null or value
Need 1000's of single-entry Queues
• Array of single word Queues
Producers start @ random index
• Search for null, CAS down value
Consumers start @ random index
• Search for value, CAS down null
2008 JavaOneSM Conference | java.sun.com/javaone |

52
Example 3: Nearly FIFO Queue
Nearly FIFO:
• Consumers must advance scan point
• Or might neglect tasks left in other slots
• Means every value in array gets visited eventually
Tricky bit: correct array size for efficiency
• Too small, table gets full, producers spin uselessly
• Too large, table is mostly empty, consumers scan uselessly
Array copy & promote is easier:
• Risk: late insert in old array just prior to promote
abandons value
• Consumers fill old array with 'tombstone'
• Promote when old array is entire 'stoned

Still need feedback mechanisms on P/C threadpools
2008 JavaOneSM Conference | java.sun.com/javaone |

53
Example 3: Nearly FIFO Queue
Work in progress, no code yet...
But out of time anyways ;-)
Nice idea, hope it pans out

2008 JavaOneSM Conference | java.sun.com/javaone |

54
Agenda
Motivation
A Scalable Non-Blocking Coding Style
Example 1: BitVector
Example 2: HashTable
Example 3: Nearly FIFO Queue
Summary

2008 JavaOneSM Conference | java.sun.com/javaone |

55
Summary
Lock-Free
Highly scalable (proven scalable to ~1000 CPUs)
Use large array for data
• Allows fast parallel-read
• Allows parallel, incremental, concurrent copy
Use Finite State Machine to control writes
• FSM-per-word
• Successful CAS advances FSM
• Failed CAS retries
During copy, FSM includes words from both arrays

http://www.azulsystems.com/blogs/cliff
2008 JavaOneSM Conference | java.sun.com/javaone |

56
Dr. Cliff Click, Distinguished Engineer
Azul Systems
http://blogs.azulsystems.com/cliff

2008 JavaOneSM Conference | java.sun.com/javaone |

57

Towards a Scalable Non-Blocking Coding Style

  • 1.
    Towards a Scalable Non-BlockingCoding Style Dr. Cliff Click, Distinguished Engineer Azul Systems http://blogs.azulsystems.com/cliff
  • 2.
    The Computer Revolutionis Here We already did the 0->1 cpu transition Concurrent Programming is Now 'The Norm' and hard to do We're doing the 1->2 cpu transition Scalable Concurrent Programming is even harder Time to think about the 2->N cpu transition Here is a different way of thinking about the problem 2008 JavaOneSM Conference | java.sun.com/javaone | 2
  • 3.
    What is Non-BlockingAlgorithm? Formally: • Stopping one thread will not prevent global progress Less formally: • No thread 'locks' any resource • and then gets pre-empted by OS • Or blocked in I/O, etc • No 'critical sections', locks, mutexs, spin-locks, etc Individual threads might starve 2008 JavaOneSM Conference | java.sun.com/javaone | 3
  • 4.
    XXX-Free Hierarchy Wait-Free Algorithms(the best) • All threads complete in finite count of steps • Low priority threads cannot block high priority threads • No priority inversion possible Lock-Free (this work) • Every successful step makes Global Progress • But individual threads may starve • Hence priority inversion is possible • No live-lock Obstruction-Free • A single thread in isolation completes in finite count of steps • Threads may block each other • Hence live-lock is possible 2008 JavaOneSM Conference | java.sun.com/javaone | 4
  • 5.
    Motivation Multi-core is nowalmost unavoidable Larger core counts more common: • 8+ (X86), 64 (Sun/ Rock), 768 (Azul, more coming) Locking suffers serious contention issues • Amdahl's Law, etc Would like to write correct code without locks! Obstruction-free can live-lock • More prone with higher cpu count • Or higher thread count Wait-free algorithms behave the best • But tend to be slow • And are very hard to code • Handful of people on the planet can write these 2008 JavaOneSM Conference | java.sun.com/javaone | 5
  • 6.
    Scalable Most large-CPU countshared-memory hardware is: • Parallel-read, Independent-write Multiple CPUs reading the same location is fast • Free 'cache-hitting-loads' Multiple CPUs writing to the same location serialize • Speed limited to '1-cache-miss-per-write' or '1-memory-bus-update-per-write' Must avoid all CPUs writing same location for independent operations • i.e., no shared counters, single lock-words, etc Classic reader/writer lock chokes w/ >100 CPUs • Contention on single reader-count word limits scaling 2008 JavaOneSM Conference | java.sun.com/javaone | 6
  • 7.
    Agenda Motivation A Scalable Non-BlockingCoding Style Example 1: BitVector Example 2: HashTable Example 3: Nearly FIFO Queue Summary 2008 JavaOneSM Conference | java.sun.com/javaone | 7
  • 8.
    Parts we need... AnArray to hold all Data • Fast parallel (scalable) access Atomic-update on single Array Words • java.util.concurrent.Atomic.* • “No spurious failure CAS” A Finite State Machine • Replicated per array word (or small set of words) • Use Atomic-Update to 'step' in the FSM Construct algorithm from many FSM 'steps' • Lock-Free: Each CAS makes progress • CAS success is local progress • CAS failure means another CAS succeeded (global progress, local starvation) 2008 JavaOneSM Conference | java.sun.com/javaone | 8
  • 9.
    How Big isthe Array? Don't answer that: Make array growable • Resize array as needed • Common operation for Collection classes Support array resize via State Machine • Really: array-copy while in use • All array words are independent • Copy is parallel, incremental, concurrent But mostly operate without a copy-in-progress • So the common situation is simple, fast 2008 JavaOneSM Conference | java.sun.com/javaone | 9
  • 10.
    Concurrent Array Resize Copyold Array into a new larger Array The hard part during a resize operation: • Copy without losing any late-writes to old Array Fix: “mark” old Array words with no-more-updates flag • Payload still visible through the “mark” Updaters' of marked payload must copy then update in new array Readers' seeing mark must copy then read in new array 2008 JavaOneSM Conference | java.sun.com/javaone | 10
  • 11.
    Atomic Update Need someform of Atomic-Update • java.util.concurrent.atomic.Atomic* Update 1 word IFF old-value is equal to expected-value Generally Compare-And-Swap (CAS, Azul/Sparc/X86) or Load-Linked / Store-Conditional (LL/SC, IBM) Common Hardware Limitations • LL/SC suffers from live-lock • Both CAS & LL/SC can suffer spurious failure on some hardware • Infinite spurious failures is live-lock(?) • Finite failures fixed with spin loop • Useful if CAS does not spuriously fail (e.g. Azul) • Especially at high CPU count • If 1000 CPUs attempt update, 1 should succeed 2008 JavaOneSM Conference | java.sun.com/javaone | 11
  • 12.
    Atomic Update: Failure CASfailure returns old value on most (all?) hardware? • Old value is evidence CAS did not fail spuriously • The “witness” - the “proof of failure” • LL/SC never provides old value The witness not available after the CAS • Overwritten by another thread JDK API mistake: witness turned into a boolean • Hence failure-for-cause can not be distinguished from spurious-failure Hence must spin on CAS failure until see reason for failure • Report either CAS success OR • CAS failure-for-cause Spinning builds a “No spurious failure CAS” 2008 JavaOneSM Conference | java.sun.com/javaone | 12
  • 13.
    Towards A ScalableLock-Free Coding Style Big Array to hold Data Parallel, Scalable read access Concurrent writes via: CAS & Finite State Machine • No JMM issues during Finite State Machine updates • No locks, no volatile Fast as a best-of-breed not-thread-safe implementation • But as correct as thread-safe implementations • Much faster than locking under heavy load • No indirections in common case • Directly reach main data array in 1 step Resize as needed • Copy Array to a larger Array on demand • Use State Machine to help copy • “Mark” old Array words to avoid missing late updates 2008 JavaOneSM Conference | java.sun.com/javaone | 13
  • 14.
    Agenda Motivation A Scalable Non-BlockingCoding Style Example 1: BitVector Example 2: HashTable Example 3: Nearly FIFO Queue Summary 2008 JavaOneSM Conference | java.sun.com/javaone | 14
  • 15.
    Example 1: BitVector Size:O(max element) • Auto-resizing Supports concurrent insert, remove, test&set Obvious implementation: • Array of 'long' - 64-bit payload words • Bit mask & shift accessors How to 'mark' payload? • Steal 1 bit out of 64 • MOD 63 to select index words – this example only • (Actually: avoid slow MOD by moving every 64th bit to recursive bitvector) Code up in SourceForge, high-scale-lib 2008 JavaOneSM Conference | java.sun.com/javaone | 15
  • 16.
    Example 1: BitVector Basicget & test/set (using MOD) boolean get( int x ) { long[] A = _A; int idx = x/63; if( idx >= A.length) return false; boolean test_set( int x ) { long[] A = _A; // read once int idx = x/63; if( idx >= A.length ) return grow(x); while( true ) { // spin loop int old = A[idx]; int old = A[idx]; if( old < 0 ) if( old < 0 ) // marked? return copy(x).get(x); return copy(x).test_set(x); long mask = 1L <<(x%63); long mask = 1L <<(x%63); return (old & mask)!=0; if( (old & mask) != 0) } return true; if( CAS(A[idx],old,old|mask)) return false; } 2008 JavaOneSM Conference | java.sun.com/javaone | 16
  • 17.
    Example 1: BitVector ReadArray once – it may change out from under us! boolean get( int x ) { long[] A = _A; int idx = x/63; if( idx >= A.length) return false; boolean test_set( int x ) { long[] A = _A; // read once int idx = x/63; if( idx >= A.length ) return grow(x); while( true ) { // spin loop int old = A[idx]; int old = A[idx]; if( old < 0 ) if( old < 0 ) // marked? return copy(x).get(x); return copy(x).test_set(x); long mask = 1L <<(x%63); long mask = 1L <<(x%63); return (old & mask)!=0; if( (old & mask) != 0) } return true; if( CAS(A[idx],old,old|mask)) return false; } 2008 JavaOneSM Conference | java.sun.com/javaone | 17
  • 18.
    Example 1: BitVector Out-of-boundstriggers resize boolean get( int x ) { long[] A = _A; int idx = x/63; if( idx >= A.length) return false; boolean test_set( int x ) { long[] A = _A; // read once int idx = x/63; if( idx >= A.length ) return grow(x); while( true ) { // spin loop int old = A[idx]; int old = A[idx]; if( old < 0 ) if( old < 0 ) // marked? return copy(x).get(x); return copy(x).test_set(x); long mask = 1L <<(x%63); long mask = 1L <<(x%63); return (old & mask)!=0; if( (old & mask) != 0) } return true; if( CAS(A[idx],old,old|mask)) return false; } 2008 JavaOneSM Conference | java.sun.com/javaone | 18
  • 19.
    Example 1: BitVector 'Mark'triggers copy & retry boolean get( int x ) { long[] A = _A; int idx = x/63; if( idx >= A.length) return false; boolean test_set( int x ) { long[] A = _A; // read once int idx = x/63; if( idx >= A.length ) return grow(x); while( true ) { // spin loop int old = A[idx]; int old = A[idx]; if( old < 0 ) if( old < 0 ) // marked? return copy(x).get(x); return copy(x).test_set(x); long mask = 1L <<(x%63); long mask = 1L <<(x%63); return (old & mask)!=0; if( (old & mask) != 0) } return true; if( CAS(A[idx],old,old|mask)) return false; } 2008 JavaOneSM Conference | java.sun.com/javaone | 19
  • 20.
    Example 1: BitVector FailedCAS must retry – BUT! • Means another thread made progress boolean get( int x ) { long[] A = _A; int idx = x/63; if( idx >= A.length) return false; boolean test_set( int x ) { long[] A = _A; // read once int idx = x/63; if( idx >= A.length ) return grow(x); while( true ) { // spin loop int old = A[idx]; int old = A[idx]; if( old < 0 ) if( old < 0 ) // marked? return copy(x).get(x); return copy(x).test_set(x); long mask = 1L <<(x%63); long mask = 1L <<(x%63); return (old & mask)!=0; if( (old & mask) != 0) } return true; if( CAS(A[idx],old,old|mask)) return false; } 2008 JavaOneSM Conference | java.sun.com/javaone | 20
  • 21.
    Example 1: BitVector Almostas fast as plain BitVector • Normal load & mask for get/set • Range check • Extra '<0' test (triggers copy & retry) • Set uses CAS spin-loop Copy: Sign-bit to stop further updates • Use CAS to set sign-bit • Then copy word to new array • Repeat operation on new array Finite State Machine! • per Array word • Hidden in the code Let's make the FSM obvious... 2008 JavaOneSM Conference | java.sun.com/javaone | 21
  • 22.
    BitVector State Machine 0000 “initial” 2008JavaOneSM Conference | java.sun.com/javaone | 22
  • 23.
    BitVector State Machine set& clear 0000 set A: Normal operations 0XXX “active” 2008 JavaOneSM Conference | java.sun.com/javaone | 23
  • 24.
    BitVector State Machine set& clear 0000 set A: Normal operations 0XXX Out-of-Bounds set triggers resize! old array new array 0000 “initial” 2008 JavaOneSM Conference | java.sun.com/javaone | 24
  • 25.
    BitVector State Machine set& clear 0000 set A: Normal operations 0XXX mark B: Mark to prevent further updates 1XXX “marked” old array new array 0000 2008 JavaOneSM Conference | java.sun.com/javaone | 25
  • 26.
    BitVector State Machine set& clear 0000 set A: Normal operations 0XXX mark B: Mark to prevent further updates 1XXX old array new array 0000 copy 0XXX C: Copy from old to new 2008 JavaOneSM Conference | java.sun.com/javaone | 26
  • 27.
    BitVector State Machine set& clear 0000 set A: Normal operations 0XXX mark B: Mark to prevent further updates 1XXX D: Memory-fence between arrays 0000 copy old array new array 0XXX C: Copy from old to new 2008 JavaOneSM Conference | java.sun.com/javaone | 27
  • 28.
    BitVector State Machine set& clear 0000 set A: Normal operations 0XXX mark B: Mark to prevent further updates 1000 copy done 1XXX “copy-done” E: Signal copydone in old table D: Memory-fence between arrays 0000 copy old array new array 0XXX C: Copy from old to new 2008 JavaOneSM Conference | java.sun.com/javaone | 28
  • 29.
    BitVector State Machine set& clear set 0000 0XXX mark B: Mark to prevent further updates mark 1000 A: Normal operations copy done E: Signal copydone in old table 1XXX C: Memory-fence between arrays 0000 copy old array new array 0XXX D: Copy from old to new 2008 JavaOneSM Conference | java.sun.com/javaone | 29
  • 30.
    Resize - motivation Triggeredby adding larger element Copy each word before get/put Pay indirection even after copy • Visit old table, fence, operate on new table So need to copy all words eventually, and then Promote: make new array the top-level array • No more indirection Policy? How to copy all words? • Visiting threads can “copy some words” • Or background threads copy, or only-writers, etc • Good standard engineering, nothing special 2008 JavaOneSM Conference | java.sun.com/javaone | 30
  • 31.
    Resize – CopyMechanics Helper: any thread copying words it does not directly need Helpers CAS-up a “promise to copy” counter • Atomic-increment by fixed N (e.g. 16 words) Helpers copy words via State Machine Helpers atomic-increment “done work” counter • On transition to “copy-done” state Promote new Array when “done work” == A.length What If: Helper stalled? (promises but never copies) • Allow helpers to “double-promise”! • Worst case: each thread can complete entire copy Eventually, copy completes & array promotes 2008 JavaOneSM Conference | java.sun.com/javaone | 31
  • 32.
    Coding Style Elements Largearray for parallel read & update • No JMM issues for read or update (no lock, no volatile) State Machine per-array-word • Successful CAS is FSM transition • Failed CAS causes retry • (but another thread made progress) 'Mark' payload words to stop 'late updates' Array copy for Resize • Copy is parallel, incremental, concurrent • Copy part of State Machine • Unrelated threads can make progress during resize • Fence between old and new tables 2008 JavaOneSM Conference | java.sun.com/javaone | 32
  • 33.
    Agenda Motivation A Scalable Non-BlockingCoding Style Example 1: BitVector Example 2: HashTable Example 3: Nearly FIFO Queue Summary 2008 JavaOneSM Conference | java.sun.com/javaone | 33
  • 34.
    Example 2: HashTable Arrayof K/V Pairs • Keys in even slots, Values odd slots • CAS each word separately, but FSM spans both words • Value can also be 'Tombstone' • Key & Value both start as null Mark payload by 'boxing' values Copy on resize, or to flush stale keys Supports concurrent insert, remove, test, resize Linear scaling on Azul to 768 CPUs • More than billion reads/sec simultaneous with • More than 10million updates/sec Code up in SourceForge, high-scale-lib • Passes Java Compatibility Kit (JCK) for ConcurrentHashMap 2008 JavaOneSM Conference | java.sun.com/javaone | 34
  • 35.
    “Uninteresting” Details Good, standardengineering – nothing special Closed Power-of-2 Hash Table • Reprobe on collision • Stride-1 reprobe: better cache behavior • (complicated argument about 2n vs prime goes here) Key & Value on same cache line Hash memoized • Should be same cache line as K + V • But hard to do in pure Java No allocation on get() or put() Auto-Resize 2008 JavaOneSM Conference | java.sun.com/javaone | 35
  • 36.
    HashTable State Machine 0/0 “initial” •InsertingK/V pair •Already probed table, missed •Found proper empty K/V slot •Ready to claim slot for this Key 2008 JavaOneSM Conference | java.sun.com/javaone | 36
  • 37.
    HashTable State Machine 0/0 insert key K/0 “barekey” Claim key slot 2008 JavaOneSM Conference | java.sun.com/javaone | 37
  • 38.
    HashTable State Machine 0/0 insert key K/V insertV “active” K/0 Initial set of Value 2008 JavaOneSM Conference | java.sun.com/javaone | 38
  • 39.
    HashTable State Machine 0/0 insert key K/V insertV K/0 delete K/T Delete uses 'tombstone' value; Key remains “deleted” 2008 JavaOneSM Conference | java.sun.com/javaone | 39
  • 40.
    HashTable State Machine changeV 0/0 insert key K/V insert V K/0 delete re-insert Change Value uses same key slot Re-insert uses same key slot K/T “deleted” 2008 JavaOneSM Conference | java.sun.com/javaone | 40
  • 41.
    HashTable State Machine changeV 0/0 K/V insert key insert V K/0 delete re-insert K/T Resize triggered, new array created old array new array 0/0 “initial” 2008 JavaOneSM Conference | java.sun.com/javaone | 41
  • 42.
    HashTable State Machine changeV 0/0 K/V insert key insert V K/0 box delete re-insert K/[V] “boxed V” K/T Boxing V prevents further changes old array new array 0/0 2008 JavaOneSM Conference | java.sun.com/javaone | 42
  • 43.
    HashTable State Machine changeV 0/0 K/V insert key insert V K/0 box delete re-insert K/[V] K/T Claim key slot in new table old array new array 0/0 insert key K/0 “bare key” 2008 JavaOneSM Conference | java.sun.com/javaone | 43
  • 44.
    HashTable State Machine changeV 0/0 K/V insert key insert V K/0 box delete re-insert K/[V] K/T old array Copy in new table without box new array 0/0 insert key K/0 copy K/V “active” 2008 JavaOneSM Conference | java.sun.com/javaone | 44
  • 45.
    HashTable State Machine changeV 0/0 K/V insert key insert V K/0 box delete re-insert K/[V] K/T Fence after writing to new array and before setting 'copy done' old array Memory-fence between arrays 0/0 insert key K/0 new array copy K/V 2008 JavaOneSM Conference | java.sun.com/javaone | 45
  • 46.
    HashTable State Machine changeV 0/0 K/V insert key insert V K/0 box delete re-insert K/[V] copy done K/T K/[T] old array Memory-fence between arrays 0/0 insert key K/0 new array copy K/V “copy done” Final state: “new Array has Value” 2008 JavaOneSM Conference | java.sun.com/javaone | 46
  • 47.
    HashTable State Machine changeV 0/0 K/V insert key insert V K/0 box delete re-insert K/[V] copy done K/T Nothing to copy Memory-fence between arrays K/[T] old array new array 0/0 2008 JavaOneSM Conference | java.sun.com/javaone | 47
  • 48.
    HashTable State Machine changeV 0/0 K/V insert key insert V K/0 box delete re-insert K/[V] copy done K/T Copy stops partial insertion Memory-fence between arrays K/[T] old array new array 0/0 2008 JavaOneSM Conference | java.sun.com/javaone | 48
  • 49.
    HashTable State Machine changeV 0/0 K/V insert key insert V K/0 box delete re-insert K/[V] copy done K/T K/[T] old array Memory-fence between arrays 0/0 insert key K/0 new array copy K/V 2008 JavaOneSM Conference | java.sun.com/javaone | 49
  • 50.
    Agenda Motivation A Scalable Non-BlockingCoding Style Example 1: BitVector Example 2: HashTable Example 3: Nearly FIFO Queue Summary 2008 JavaOneSM Conference | java.sun.com/javaone | 50
  • 51.
    Example 3: NearlyFIFO Queue Concurrent near-FIFO Queue • e.g. producer / consumer worklist • Producers & consumers are large thread pools Scaling bottleneck: • Locking or single word CAS on push & pop Could stripe Queue: • Many short Queues • Select random Queue • Many different locks or many different words to CAS • Less contention • Pick at random to push or pop • Must search all queues for not-full or not-empty 2008 JavaOneSM Conference | java.sun.com/javaone | 51
  • 52.
    Example 3: NearlyFIFO Queue 1000's of CPUs need 1000's of Queues • Stripe Ad-Absurdum • Queues get ever-smaller • Get down to Queues of 1 entry Single-entry Queue: either full or empty • Implement as a single word • Either null or value Need 1000's of single-entry Queues • Array of single word Queues Producers start @ random index • Search for null, CAS down value Consumers start @ random index • Search for value, CAS down null 2008 JavaOneSM Conference | java.sun.com/javaone | 52
  • 53.
    Example 3: NearlyFIFO Queue Nearly FIFO: • Consumers must advance scan point • Or might neglect tasks left in other slots • Means every value in array gets visited eventually Tricky bit: correct array size for efficiency • Too small, table gets full, producers spin uselessly • Too large, table is mostly empty, consumers scan uselessly Array copy & promote is easier: • Risk: late insert in old array just prior to promote abandons value • Consumers fill old array with 'tombstone' • Promote when old array is entire 'stoned Still need feedback mechanisms on P/C threadpools 2008 JavaOneSM Conference | java.sun.com/javaone | 53
  • 54.
    Example 3: NearlyFIFO Queue Work in progress, no code yet... But out of time anyways ;-) Nice idea, hope it pans out 2008 JavaOneSM Conference | java.sun.com/javaone | 54
  • 55.
    Agenda Motivation A Scalable Non-BlockingCoding Style Example 1: BitVector Example 2: HashTable Example 3: Nearly FIFO Queue Summary 2008 JavaOneSM Conference | java.sun.com/javaone | 55
  • 56.
    Summary Lock-Free Highly scalable (provenscalable to ~1000 CPUs) Use large array for data • Allows fast parallel-read • Allows parallel, incremental, concurrent copy Use Finite State Machine to control writes • FSM-per-word • Successful CAS advances FSM • Failed CAS retries During copy, FSM includes words from both arrays http://www.azulsystems.com/blogs/cliff 2008 JavaOneSM Conference | java.sun.com/javaone | 56
  • 57.
    Dr. Cliff Click,Distinguished Engineer Azul Systems http://blogs.azulsystems.com/cliff 2008 JavaOneSM Conference | java.sun.com/javaone | 57