Understanding of
Linux Kernel Memory Model
SeongJae Park <sj38.park@gmail.com>
Great To Meet You
● SeongJae Park <sj38.park@gmail.com>
● Started contribution to Linux kernel just for fun since 2012
● Developing Guaranteed Contiguous Memory Allocator
○ Source code is available: https://lwn.net/Articles/634486/
● Maintaining Korean translation of Linux kernel memory barrier document
○ The translation has merged into mainline since v4.9-rc1
○ https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/ko_KR/memory-barriers.txt?h=v4.9-rc1
https://www.linux.com/sites/lcom/files/styles/rendered_file/public/kernel-dev-2015.jpg?itok=9s3mV2Nc
Programmers in Multi-core Land
● Processor vendors changed their mind to increase number of cores instead of
clock speed a decade ago
○ Now, multi-core system is prevalent
○ Even octa-core portable bomb in your pocket, maybe?
http://www.gotw.ca/images/CPU.png
GIVE UP :’(
Programmers in Multi-core Land
● Processor vendors changed their mind to increase number of cores instead of
clock speed a decade ago
○ Now, multi-core system is prevalent
○ Even octa-core portable bomb in your pocket, maybe?
● As a result, the free lunch is over;
parallel programming is essential for high performance and scalability
http://www.gotw.ca/images/CPU.png
GIVE UP :’(
Writing Correct Parallel Program is Hard
● Compilers and processors are optimized for Instructions Per Cycle, not
programmer perspective goals such as response time or throughput of
meaningful (in people’s context) progress
Writing Correct Parallel Program is Hard
● Compilers and processors are optimized for Instructions Per Cycle, not
programmer perspective goals such as response time or throughput of
meaningful (in people’s context) progress
● Nature of parallelism is counter-intuitive
○ Time is relative, before and after is ambiguous, even simultaneous available
CPU 0 CPU 1
A = 1;
B = 1;
A = 2;
B = 2;
assert(B == 2 && A == 1)
CPU 1 assertion can be true on most parallel programming environments
Writing Correct Parallel Program is Hard
● Compilers and processors are optimized for Instructions Per Cycle, not
programmer perspective goals such as response time or throughput of
meaningful (in people’s context) progress
● Nature of parallelism is counter-intuitive
○ Time is relative, before and after is ambiguous, even simultaneous available
● C language developed with Uni-Processor assumption
■ “Et tu, C?”
CPU 0 CPU 1
A = 1;
B = 1;
A = 2;
B = 2;
assert(B == 2 && A == 1)
CPU 1 assertion can be true on most parallel programming environments
TL; DR
● Memory operations can be reordered, merged, or discarded in any way
unless it violates memory model defined behavior
○ In short, ``We’re all mad here’’ in parallel land
● Knowing memory model is important to write correct, fast, scalable parallel
program
https://ih1.redbubble.net/image.218483193.6460/sticker,220x200-pad,220x200,ffffff.u2.jpg
Reordering for Better IPC[*]
[*]
IPC: Instructions Per Cycle
Simple Program Execution Sequence
● Programmer writes program in C-like human readable language
#include <stdio.h>
int main(void)
{
printf("hello worldn");
return 0;
}
Simple Program Execution Sequence
● Programmer writes program in C-like human readable language
● Compiler translates human readable code into assembly language
#include <stdio.h>
int main(void)
{
printf("hello worldn");
return 0;
}
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
Compiler
Simple Program Execution Sequence
● Programmer writes program in C-like human readable language
● Compiler translates human readable code into assembly language
● Assembler generates executable binary from the assembly code
#include <stdio.h>
int main(void)
{
printf("hello worldn");
return 0;
}
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
00000000: 7f45 4c46 0201 0100 0000
0000 0000 0000 .ELF............
00000010: 0200 3e00 0100 0000 3004
4000 0000 0000 ..>.....0.@.....
00000020: 4000 0000 0000 0000 d819
0000 0000 0000 @...............
00000030: 0000 0000 4000 3800 0900
4000
AssemblerCompiler
Simple Program Execution Sequence
● Programmer writes program in C-like human readable language
● Compiler translates human readable code into assembly language
● Assembler generates executable binary from the assembly code
● Processor executes instruction sequence in the binary
#include <stdio.h>
int main(void)
{
printf("hello worldn");
return 0;
}
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
00000000: 7f45 4c46 0201 0100 0000
0000 0000 0000 .ELF............
00000010: 0200 3e00 0100 0000 3004
4000 0000 0000 ..>.....0.@.....
00000020: 4000 0000 0000 0000 d819
0000 0000 0000 @...............
00000030: 0000 0000 4000 3800 0900
4000
AssemblerCompiler
Simple Program Execution Sequence
● Programmer writes program in C-like human readable language
● Compiler translates human readable code into assembly language
● Assembler generates executable binary from the assembly code
● Processor executes instruction sequence in the binary
○ Execution result is guaranteed to be same with sequential execution;
In other words, the execution itself is not guaranteed to be sequential
#include <stdio.h>
int main(void)
{
printf("hello worldn");
return 0;
}
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
00000000: 7f45 4c46 0201 0100 0000
0000 0000 0000 .ELF............
00000010: 0200 3e00 0100 0000 3004
4000 0000 0000 ..>.....0.@.....
00000020: 4000 0000 0000 0000 d819
0000 0000 0000 @...............
00000030: 0000 0000 4000 3800 0900
4000
AssemblerCompiler
Instruction Level Parallelism (ILP)
● Pipelining introduces instruction level parallelism
○ Each instruction is splitted up into a sequence of steps;
Each step can be executed in parallel, instructions can be processed concurrently
fetch decode execute
fetch decode execute
fetch decode execute
Instruction 1
Instruction 2
Instruction 3
Instruction Level Parallelism (ILP)
● Pipelining introduces instruction level parallelism
○ Each instruction is splitted up into a sequence of steps;
Each step can be executed in parallel, instructions can be processed concurrently
fetch decode execute
fetch decode execute
fetch decode execute
If not pipelined, 3 cycles per instruction
3-depth pipeline can retire 3 instructions in 5 cycle: 1.7 cycles per instruction
Instruction 1
Instruction 2
Instruction 3
Dependent Instructions Harm ILP
● If an instruction is dependent to result of previous instruction,
it should wait until the previous one finishes execution
○ E.g., a = b + c;
d = a + b;
fetch decode execute
fetch decode execute
fetch decode execute
Instruction 1
Instruction 2
Instruction 3
In this case, instruction 2 depends on result of instruction 1
(e.g., first instruction modifies opcode of next instruction)
Dependent Instructions Harm ILP
● If an instruction is dependent to result of previous instruction,
it should wait until the previous one finishes execution
○ E.g., a = b + c;
d = a + b;
fetch decode execute
fetch decode execute
fetch decode execute
In this case, instruction 2 depends on result of instruction 1
(e.g., first instruction modifies opcode of next instruction)
7 cycles for 3 instructions: 2.3 cycles per instruction
Instruction 1
Instruction 2
Instruction 3
Instruction Reordering Helps Performance
● By reordering dependent instructions to be located in far away, total execution
time can be shorten
● If the reordering is guaranteed to not change the result of the instruction
sequence, it would be helpful for better performance
fetch decode execute
fetch decode execute
fetch decode execute
Instruction 1
Instruction 3
Instruction 2
instruction 2 depends on result of instruction 1
(e.g., first instruction modifies opcode of next instruction)
Instruction Reordering Helps Performance
● By reordering dependent instructions to be located in far away, total execution
time can be shorten
● If the reordering is guaranteed to not change the result of the instruction
sequence, it would be helpful for better performance
fetch decode execute
fetch decode execute
fetch decode execute
Instruction 1
Instruction 3
Instruction 2
instruction 2 depends on result of instruction 1
(e.g., first instruction modifies opcode of next instruction)
By reordering instruction 2 and 3, total execution time can be shorten
6 cycles for 3 instructions: 2 cycles per instruction
Reordering is Legal, Encouraged Behavior, But...
● If program causality is guaranteed, any
reordering is legal
● Processors and compilers can make reordering
of instructions for better IPC
Reordering is Legal, Encouraged Behavior, But...
● If program causality is guaranteed, any
reordering is legal
● Processors and compilers can make reordering
of instructions for better IPC
● program causality defined with single processor
environment
● IPC focused reordering doesn’t aware
programmer perspective performance goals
such as throughput or latency
● On Multi-processor system, reordering could
harm not only correctness, but also
performance
https://s-media-cache-ak0.pinimg.com/originals/c2/1a/00/c21a007e0542f7da57dacd15a86d478d.jpg
Throughput or latency?
Instructions Per Cycle
Counter-intuitive Nature of
Parallelism
Time is Relative (E = MC2
)
● Each CPU generates their events in their time, observes effects of events in
relative time
● It is impossible to define absolute order of two concurrent events;
Only relative observation order is possible
CPU 1 CPU 2 CPU 3 CPU 4
Generated
event 1
Generated
event 2
Observed
event 1
followed by
event 2
I observed
event 2
followed by
event 1
Event Bus
Relative Event Propagation of Hierarchical Memory
● Most system equip hierarchical memory for better performance and space
● Propagation speed of an event to a given core can be influenced by specific
sub-layer of memory
If CPU 0 Message Queue is busy, CPU 2 can observe an event from
CPU 1 (event A) followed by an event of CPU 0 (event B)
though CPU 1 observed event B before generating event A
CPU 0 CPU 1
Cache
CPU 0
Message
Queue
CPU 1
Message
Queue
Memory
CPU 2 CPU 3
Cache
CPU 2
Message
Queue
CPU 3
Message
Queue
Bus
Relative Event Propagation of Hierarchical Memory
● Most system equip hierarchical memory for better performance and space
● Propagation speed of an event to a given core can be influenced by specific
sub-layer of memory
If CPU 0 Message Queue is busy, CPU 2 can observe an event from
CPU 0 (event A) after an event of CPU 1 (event B)
though CPU 1 observed event A before generating event B
CPU 0 CPU 1
Cache
CPU 0
Message
Queue
CPU 1
Message
Queue
Memory
CPU 2 CPU 3
Cache
CPU 2
Message
Queue
CPU 3
Message
Queue
Bus
Generate
Event A;
Event A
Relative Event Propagation of Hierarchical Memory
● Most system equip hierarchical memory for better performance and space
● Propagation speed of an event to a given core can be influenced by specific
sub-layer of memory
If CPU 0 Message Queue is busy, CPU 2 can observe an event from
CPU 0 (event A) after an event of CPU 1 (event B)
though CPU 1 observed event A before generating event B
CPU 0 CPU 1
Cache
CPU 0
Message
Queue
CPU 1
Message
Queue
Memory
CPU 2 CPU 3
Cache
CPU 2
Message
Queue
CPU 3
Message
Queue
Bus
Generate
Event A;
Seen Event A;
Generate
Event B;
Event BEvent A
Relative Event Propagation of Hierarchical Memory
● Most system equip hierarchical memory for better performance and space
● Propagation speed of an event to a given core can be influenced by specific
sub-layer of memory
If CPU 0 Message Queue is busy, CPU 2 can observe an event from
CPU 0 (event A) after an event of CPU 1 (event B)
though CPU 1 observed event A before generating event B
CPU 0 CPU 1
Cache
CPU 0
Message
Queue
CPU 1
Message
Queue
Memory
CPU 2 CPU 3
Cache
CPU 2
Message
Queue
CPU 3
Message
Queue
Bus
Generate
Event A;
Seen Event A;
Generate
Event B;
Seen Event B;
Event A
Event B
Busy… ;;;
Relative Event Propagation of Hierarchical Memory
● Most system equip hierarchical memory for better performance and space
● Propagation speed of an event to a given core can be influenced by specific
sub-layer of memory
If CPU 0 Message Queue is busy, CPU 2 can observe an event from
CPU 0 (event A) after an event of CPU 1 (event B)
though CPU 1 observed event A before generating event B
CPU 0 CPU 1
Cache
CPU 0
Message
Queue
CPU 1
Message
Queue
Memory
CPU 2 CPU 3
Cache
CPU 2
Message
Queue
CPU 3
Message
Queue
Bus
Generate
Event A;
Seen Event A;
Generate
Event B;
Seen Event B;
Seen Event A;
Event B
Event A
Cache Coherency is Eventual
● It is well known that cache coherency protocol helps system memory
consistency
● In actual, cache coherency guarantees eventual consistency only
● Every effect of each CPU will eventually become visible on all CPUs, but
There’s no guarantee that they will become apparent in the same order on
those other CPUs
http://img06.deviantart.net/0663/i/2005/112/b/6/schrodinger__s_cat___2_by_firefoxcentral.jpg
System with Store Buffer and Invalidation Queue
● Store Buffer and Invalidation Queue deliver effect of event but does not
guarantee order of observation on each CPU
CPU 0
Cache
Store
Buffer
Invalidation
Queue
Memory
CPU 1
Cache
Store
Buffer
Invalidation
Queue
Bus
C-language and
Multi-Processor
C-language Doesn’t Know Multi-Processor
● By the time of initial C-language development, multi-processor was rare
● As a result, C-language has only few guarantees about memory operations
on multi-processor
● Undefined behavior is allowed for undefined case
https://upload.wikimedia.org/wikipedia/commons/thumb/9/95/The_C_Programming
_Language,_First_Edition_Cover_(2).svg/2000px-The_C_Programming_Languag
e,_First_Edition_Cover_(2).svg.png
Compiler Optimizes Code
● Clever compilers try hard (really hard) to optimize code for high IPC
(again, not for programmer perspective goals)
○ Converts small, private function to inline code
○ Reorder memory access code to minimize dependency
○ Simplify unnecessarily complex loops, ...
● Optimization uses term `Undefined behavior` as they want
○ It’s legal, but sometimes do insane things in programmer’s perspective
● Memory access reordering of compiler based on C-standard, which doesn’t
aware multi-processor system, can generate unintended program
● Linux kernel uses compiler directives and volatile keyword to enforce memory
ordering
● C11 has much more improvement, though
Memory Models
Each Environment Provides Own Memory Model
● Memory Model defines how memory operations are generated, what effects it
makes, how their effects will be propagated
● Each programming environment like Instruction Set Architecture,
Programming language, Operating system, etc defines own memory model
○ Most modern language memory models (e.g., Golang, Rust, …) aware multi-processor
http://www.sciencemag.org/sites/default/files/styles/article_main_large/public/Memory.jpg?itok=4FmHo7M5
Each ISA Provides Specific Memory Model
● Some architectures have stricter ordering enforcement rule than others
● PA-RISC CPUS
is strictest, Alpha is weakest
● Because Linux kernel supports multiple architectures, it defines its memory
model based on weakest one, Alpha
https://kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.2015.01.31a.pdf
Synchronization Primitives
● Though reordering and asynchronous effect propagation is legal,
synchronization primitives are necessary to write human intuitive program
● Most memory model provides synchronization primitives like atomic
instructions, memory barriers, etc
https://s-media-cache-ak0.pinimg.com/236x/42/bc/55/42bc55a6d7e5affe2d0dbe9c872a3df9.jpg
Atomic Operations
● Atomic operations is configured with multiple sub operations
○ E.g., compare-and-swap, fetch-and-add, test-and-set
● Atomic operations have mutual exclusiveness
○ Middle state of atomic operation execution cannot be seen by others
○ It can be thought of as small critical section that protected by a global lock
● Almost every hardware supports basic atomic operations
● In general, atomic operations are expensive
○ Misuse of atomic operations can severely degrade performance and scalability
http://www.scienceclarified.com/photos/atomic-mass-3024.jpg
Memory Barriers
● To allow synchronization of memory operations, memory model provides
enforcement primitives, namely, memory barriers
● In general, memory barriers guarantee
effects of memory operations issued before it
to be propagated to other components (e.g., processor) in the system
before memory operations issued after the barrier
● In general, memory barrier is expensive operation
CPU 1 CPU 2 CPU 3
READ A;
WRITE B;
<BARRIER>;
READ C;
READ A,
WRITE B,
than READ C
occurred
WRITE B,
READ A,
than READ C
occurred
READ A and WRITE B can be reordered but READ C is guaranteed to
be ordered after {READ A, WRITE B}
Linux Kernel Memory Model
● Defined by weakest architecture, Alpha
○ Almost every combination of reordering is possible
● Provides rich set of atomic instructions
○ atomix_xchg(), atomic_inc_return(), atomic_dec_return(), ...
● Provides CPU level barriers, Compiler level barriers, semantic level barriers
○ Compiler barriers: WRITE_ONCE(), READ_ONCE(), barrier(), ...
○ CPU barriers: mb(), wmb(), rmb(), smp_mb(), smp_wmb(), smp_rmb(), …
○ Semantical barriers: ACQUIRE operations, RELEASE operations, …
○ For detail, refer to https://www.kernel.org/doc/Documentation/memory-barriers.txt
● Because different barrier has different overhead, only necessary barrier
should be used in necessary case for high performance and scalability
Case Studies
Memory Operation Reordering
● Memory Operation Reordering is totally LEGAL unless it breaks causality
● Both of CPU and Compiler can do it, even in Single Processor
CPU 0 CPU 1 CPU 2
A = 1;
B = 1;
while (B == 0) {}
C = 1;
Z = C;
X = A;
assert(z == 0 || x == 1)
Memory Operation Reordering
● Memory Operation Reordering is totally LEGAL unless it breaks causality
● Both of CPU and Compiler can do it, even in Single Processor
CPU 0 CPU 1 CPU 2
A = 1;
B = 1;
while (B == 0) {}
C = 1;
Z = C;
X = A;
assert(z == 0 || x == 1)
:)
Memory Operation Reordering
● Memory Operation Reordering is totally LEGAL unless it breaks causality
● Both of CPU and Compiler can do it, even in Single Processor
CPU 0 CPU 1 CPU 2
A = 1;
B = 1;
while (B == 1) {}
C = 1;
Z = C;
X = A;
assert(z == 0 || x == 1)
:)
Memory Operation Reordering
● Memory Operation Reordering is totally LEGAL unless it breaks causality
● Both of CPU and Compiler can do it, even in Single Processor
CPU 0 CPU 1 CPU 2
B = 1;
A = 1;
while (B == 0) {}
C = 1;
Z = C;
X = A;
assert(z == 0 || x == 1)
Memory Operation Reordering
● Memory Operation Reordering is totally LEGAL unless it breaks causality
● Both of CPU and Compiler can do it, even in Single Processor
CPU 0 CPU 1 CPU 2
B = 1;
A = 1;
while (B == 0) {}
C = 1;
X = A;
Z = C;
assert(z == 0 || x == 1)
?????
Memory Operation Reordering
● Memory Operation Reordering is totally LEGAL unless it breaks causality
● Both of CPU and Compiler can do it, even in Single Processor
● Memory barrier enforces operations specified before it appear as happened to
operations specified after it
CPU 0 CPU 1 CPU 2
A = 1;
wmb();
B = 1;
while (B == 0) {}
mb();
C = 1;
Z = C;
rmb();
X = A;
assert(z == 0 || x == 1)
Memory Operation Reordering
● Memory Operation Reordering is totally LEGAL unless it breaks causality
● Both of CPU and Compiler can do it, even in Single Processor
● Memory barrier enforces operations specified before it appear as happened to
operations specified after it
● In some architecture, even Transitivity is not guaranteed
○ Transitivity: B happened after A; C happened after B; then C happened after A
CPU 0 CPU 1 CPU 2
A = 1;
wmb();
B = 1;
while (B == 0) {}
mb();
C = 1;
Z = C;
rmb();
X = A;
assert(z == 0 || x == 1)
Transitivity for Scheduler and Workers
Scheduler and each workers made consensus about order
Scheduler
Worker A
Worker B
Worker Z
...
...
What time
is it now?
Night!
Night!
Night!
...
Transitivity between Scheduler and Worker
Scheduler and each workers made consensus about order
Scheduler
Worker A
Worker B
Worker Z
...
... Yay!
...
Worker Z, all
workers agreed
that it’s night. Do
bedmaking!
Transitivity between Scheduler and Worker
Scheduler and each workers made consensus about order
But, worker B and worker Z didn’t made consensus
Scheduler
Worker A
Worker B
Worker Z
...
... !!??
...
Worker Z, I’m in
afternoon! I didn’t
tell you it’s night!
Compiler Memory Barrier
Code Example
Compiler Reordering Avoidance
● Compiler can remove loop entirely
C code Assembly language code
static int the_var;
void loop(void)
{
int i;
for (i = 0; i < 1000; i++)
the_var++;
}
loop:
.LFB106:
.cfi_startproc
addl $1000, the_var(%rip)
ret
.cfi_endproc
.LFE106:
Compiler Reordering Avoidance
● ACCESS_ONCE() is a compiler memory barrier implementation of Linux kernel
● Store to the_var could not be seen by others
C code Assembly language code
static int the_var;
void loop(void)
{
int i;
for (i = 0; ACCESS_ONCE(i) < 1000; i++)
the_var++;
}
loop:
...
movl the_var(%rip), %ecx
.L175:
...
addl $1, %eax
...
cmpl $999, %edx
jle .L175
movl %esi, the_var(%rip)
.L170:
rep ret
Compiler Reordering Avoidance
● Still, store to `the_var` not issued for every iteration
C code Assembly language code
static int the_var;
void loop(void)
{
int i;
for (i = 0; ACCESS_ONCE(i) < 1000; i++)
the_var++;
}
loop:
...
movl the_var(%rip), %ecx
.L175:
...
addl $1, %eax
...
cmpl $999, %edx
jle .L175
movl %esi, the_var(%rip)
.L170:
rep ret
Compiler Reordering Avoidance
● volatile enforces compiler to issue memory operation as programmer want
(Note that it is not enforced to do DRAM access)
● However, repetitive LOAD may harm performance
C code Assembly language code
static volatile int the_var;
void loop(void)
{
int i;
for (i = 0; ACCESS_ONCE(i) < 1000; i++)
the_var++;
}
loop:
...
.L174:
movl the_var(%rip), %edx
...
addl $1, %edx
movl %edx, the_var(%rip)
...
cmpl $999, %edx
jle .L174
.L170:
rep ret
.cfi_endproc
Compiler Reordering Avoidance
● Complete memory barrier can help the case
● Does memory access once and uses register for loop condition check
C code Assembly language code
static int the_var;
void loop(void)
{
int i;
for (i = 0; i < 1000; i++)
the_var++;
barrier();
}
loop:
.LFB106:
...
.L172:
addl $1, the_var(%rip)
subl $1, %eax
jne .L172
rep ret
.cfi_endproc
CPU Memory Barrier
Code Example
Progress perception
● Code does issue LOAD and STORE, but…
● see_progress() can see no progress because change made by a processor
propagates to other processor eventually, not immediately
C code Assembly language code
static int prgrs;
void do_progress(void)
{
prgrs++;
}
void see_progress(void)
{
static int last_prgrs;
static int seen;
static int nr_seen;
seen = prgrs;
if (seen > last_prgrs)
nr_seen++;
last_prgrs = seen;
}
do_progress:
...
addl $1, prgrs(%rip)
ret
...
see_progress:
...
movl prgrs(%rip), %eax
...
jle .L193
addl $1, nr_seen.5542(%rip)
.L193:
movl %eax, last_prgrs.5540(%rip)
ret
.cfi_endproc
Progress perception
● Read barrier and write barrier helps the situation
C code Assembly language code
static int prgrs;
void do_progress(void)
{
prgrs++;
smp_wmb();
}
void see_progress(void)
{
static int last_prgrs;
static int seen;
static int nr_seen;
smp_rmb();
seen = prgrs;
if (seen > last_prgrs)
nr_seen++;
last_prgrs = seen;
}
do_progress:
...
addl $1, prgrs(%rip)
...
sfence
ret
see_progress:
...
lfence
...
movl prgrs(%rip), %eax
...
jle .L193
addl $1, nr_seen.5542(%rip)
.L193:
movl %eax, last_prgrs.5540(%rip)
Memory Ordering of X86
Neither Loads Nor Stores Are Reordered with Likes
CPU 0 CPU 1
STORE 1 X
STORE 1 Y
R1 = LOAD Y
R2 = LOAD X
R1 == 1 && R2 == 0 impossible
Stores Are Not Reordered With Earlier Loads
CPU 0 CPU 1
R1 = LOAD X
STORE 1 Y
R2 = LOAD Y
STORE 1 X
R1 == 1 && R2 == 1 impossible
Loads May Be Reordered with Earlier Stores to
Different Locations
CPU 0 CPU 1
STORE 1 X
R1 = LOAD Y
STORE 1 Y
R2 = LOAD X
R1 == 0 && R2 == 0 possible
Intra-Processor Forwarding Is Allowed
CPU 0 CPU 1
STORE 1 X
R1 = LOAD X
R2 = LOAD Y
STORE 1 Y
R3 = LOAD Y
R4 = LOAD X
R2 == 0 && R4 == 0 possible
Stores Are Transitively Visible
CPU 0 CPU 1 CPU 2
STORE 1 X R1 = LOAD X
STORE 1 Y
R2 = LOAD Y
R3 = LOAD X
R1 == 1 && R2 == 1 && R3 == 0 impossible
Stores Are Seen in a Consistent Order by Others
CPU 0 CPU 1 CPU 2 CPU 3
STORE 1 X STORE 1 Y R1 = LOAD X
R2 = LOAD Y
R3 = LOAD Y
R4 = LOAD X
R1 == 0 && R2 == 0 && R3 == 1 && R4 == 0 impossible
X86 Memory Ordering Summary
● LOAD after LOAD never reordered
● STORE after STORE never reordered
● STORE after LOAD never reordered
● STOREs are transitively visible
● STOREs are seen in consistent order by others
● Intra-processor STORE forwarding is possible
● LOAD from different location after STORE may be reordered
● In short, quite reasonably strong enough
● For more detail, refer to `Intel Architecture Software Developer’s Manual`
Summary
● Nature of Parallel Land is counter-intuitive
○ Cannot define order of events without interaction
○ Ordering rule is different for different environment
○ Memory model defines their ordering rule
○ In short, they’re all mad here
● For human-intuitive and correct program, interaction is necessary
○ Every memory model provides synchronization primitives like atomic instruction and memory
barrier, etc
○ Such interaction is expensive in common
● Linux kernel memory model is based on weakest memory model, Alpha
○ Kernel programmers should assume Alpha when writing architecture independent code
○ Because of the expensive cost of synchronization primitives, programmer should use only
necessary primitives on necessary location
Thanks
This work by SeongJae Park is licensed under the
Creative Commons Attribution-ShareAlike 3.0 Unported
License. To view a copy of this license, visit
http://creativecommons.org/licenses/by-sa/3.0/.
These Slides Has Been Presented For
● SW Maestro 100+ Conference 2016 (http://onoffmix.com/event/57240)
● KOSSCON 2016 (https://kosscon.kr/2016)

Understanding of linux kernel memory model

  • 1.
    Understanding of Linux KernelMemory Model SeongJae Park <sj38.park@gmail.com>
  • 2.
    Great To MeetYou ● SeongJae Park <sj38.park@gmail.com> ● Started contribution to Linux kernel just for fun since 2012 ● Developing Guaranteed Contiguous Memory Allocator ○ Source code is available: https://lwn.net/Articles/634486/ ● Maintaining Korean translation of Linux kernel memory barrier document ○ The translation has merged into mainline since v4.9-rc1 ○ https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/ko_KR/memory-barriers.txt?h=v4.9-rc1 https://www.linux.com/sites/lcom/files/styles/rendered_file/public/kernel-dev-2015.jpg?itok=9s3mV2Nc
  • 3.
    Programmers in Multi-coreLand ● Processor vendors changed their mind to increase number of cores instead of clock speed a decade ago ○ Now, multi-core system is prevalent ○ Even octa-core portable bomb in your pocket, maybe? http://www.gotw.ca/images/CPU.png GIVE UP :’(
  • 4.
    Programmers in Multi-coreLand ● Processor vendors changed their mind to increase number of cores instead of clock speed a decade ago ○ Now, multi-core system is prevalent ○ Even octa-core portable bomb in your pocket, maybe? ● As a result, the free lunch is over; parallel programming is essential for high performance and scalability http://www.gotw.ca/images/CPU.png GIVE UP :’(
  • 5.
    Writing Correct ParallelProgram is Hard ● Compilers and processors are optimized for Instructions Per Cycle, not programmer perspective goals such as response time or throughput of meaningful (in people’s context) progress
  • 6.
    Writing Correct ParallelProgram is Hard ● Compilers and processors are optimized for Instructions Per Cycle, not programmer perspective goals such as response time or throughput of meaningful (in people’s context) progress ● Nature of parallelism is counter-intuitive ○ Time is relative, before and after is ambiguous, even simultaneous available CPU 0 CPU 1 A = 1; B = 1; A = 2; B = 2; assert(B == 2 && A == 1) CPU 1 assertion can be true on most parallel programming environments
  • 7.
    Writing Correct ParallelProgram is Hard ● Compilers and processors are optimized for Instructions Per Cycle, not programmer perspective goals such as response time or throughput of meaningful (in people’s context) progress ● Nature of parallelism is counter-intuitive ○ Time is relative, before and after is ambiguous, even simultaneous available ● C language developed with Uni-Processor assumption ■ “Et tu, C?” CPU 0 CPU 1 A = 1; B = 1; A = 2; B = 2; assert(B == 2 && A == 1) CPU 1 assertion can be true on most parallel programming environments
  • 8.
    TL; DR ● Memoryoperations can be reordered, merged, or discarded in any way unless it violates memory model defined behavior ○ In short, ``We’re all mad here’’ in parallel land ● Knowing memory model is important to write correct, fast, scalable parallel program https://ih1.redbubble.net/image.218483193.6460/sticker,220x200-pad,220x200,ffffff.u2.jpg
  • 9.
    Reordering for BetterIPC[*] [*] IPC: Instructions Per Cycle
  • 10.
    Simple Program ExecutionSequence ● Programmer writes program in C-like human readable language #include <stdio.h> int main(void) { printf("hello worldn"); return 0; }
  • 11.
    Simple Program ExecutionSequence ● Programmer writes program in C-like human readable language ● Compiler translates human readable code into assembly language #include <stdio.h> int main(void) { printf("hello worldn"); return 0; } main: .LFB0: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movq %rsp, %rbp Compiler
  • 12.
    Simple Program ExecutionSequence ● Programmer writes program in C-like human readable language ● Compiler translates human readable code into assembly language ● Assembler generates executable binary from the assembly code #include <stdio.h> int main(void) { printf("hello worldn"); return 0; } main: .LFB0: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movq %rsp, %rbp 00000000: 7f45 4c46 0201 0100 0000 0000 0000 0000 .ELF............ 00000010: 0200 3e00 0100 0000 3004 4000 0000 0000 ..>.....0.@..... 00000020: 4000 0000 0000 0000 d819 0000 0000 0000 @............... 00000030: 0000 0000 4000 3800 0900 4000 AssemblerCompiler
  • 13.
    Simple Program ExecutionSequence ● Programmer writes program in C-like human readable language ● Compiler translates human readable code into assembly language ● Assembler generates executable binary from the assembly code ● Processor executes instruction sequence in the binary #include <stdio.h> int main(void) { printf("hello worldn"); return 0; } main: .LFB0: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movq %rsp, %rbp 00000000: 7f45 4c46 0201 0100 0000 0000 0000 0000 .ELF............ 00000010: 0200 3e00 0100 0000 3004 4000 0000 0000 ..>.....0.@..... 00000020: 4000 0000 0000 0000 d819 0000 0000 0000 @............... 00000030: 0000 0000 4000 3800 0900 4000 AssemblerCompiler
  • 14.
    Simple Program ExecutionSequence ● Programmer writes program in C-like human readable language ● Compiler translates human readable code into assembly language ● Assembler generates executable binary from the assembly code ● Processor executes instruction sequence in the binary ○ Execution result is guaranteed to be same with sequential execution; In other words, the execution itself is not guaranteed to be sequential #include <stdio.h> int main(void) { printf("hello worldn"); return 0; } main: .LFB0: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movq %rsp, %rbp 00000000: 7f45 4c46 0201 0100 0000 0000 0000 0000 .ELF............ 00000010: 0200 3e00 0100 0000 3004 4000 0000 0000 ..>.....0.@..... 00000020: 4000 0000 0000 0000 d819 0000 0000 0000 @............... 00000030: 0000 0000 4000 3800 0900 4000 AssemblerCompiler
  • 15.
    Instruction Level Parallelism(ILP) ● Pipelining introduces instruction level parallelism ○ Each instruction is splitted up into a sequence of steps; Each step can be executed in parallel, instructions can be processed concurrently fetch decode execute fetch decode execute fetch decode execute Instruction 1 Instruction 2 Instruction 3
  • 16.
    Instruction Level Parallelism(ILP) ● Pipelining introduces instruction level parallelism ○ Each instruction is splitted up into a sequence of steps; Each step can be executed in parallel, instructions can be processed concurrently fetch decode execute fetch decode execute fetch decode execute If not pipelined, 3 cycles per instruction 3-depth pipeline can retire 3 instructions in 5 cycle: 1.7 cycles per instruction Instruction 1 Instruction 2 Instruction 3
  • 17.
    Dependent Instructions HarmILP ● If an instruction is dependent to result of previous instruction, it should wait until the previous one finishes execution ○ E.g., a = b + c; d = a + b; fetch decode execute fetch decode execute fetch decode execute Instruction 1 Instruction 2 Instruction 3 In this case, instruction 2 depends on result of instruction 1 (e.g., first instruction modifies opcode of next instruction)
  • 18.
    Dependent Instructions HarmILP ● If an instruction is dependent to result of previous instruction, it should wait until the previous one finishes execution ○ E.g., a = b + c; d = a + b; fetch decode execute fetch decode execute fetch decode execute In this case, instruction 2 depends on result of instruction 1 (e.g., first instruction modifies opcode of next instruction) 7 cycles for 3 instructions: 2.3 cycles per instruction Instruction 1 Instruction 2 Instruction 3
  • 19.
    Instruction Reordering HelpsPerformance ● By reordering dependent instructions to be located in far away, total execution time can be shorten ● If the reordering is guaranteed to not change the result of the instruction sequence, it would be helpful for better performance fetch decode execute fetch decode execute fetch decode execute Instruction 1 Instruction 3 Instruction 2 instruction 2 depends on result of instruction 1 (e.g., first instruction modifies opcode of next instruction)
  • 20.
    Instruction Reordering HelpsPerformance ● By reordering dependent instructions to be located in far away, total execution time can be shorten ● If the reordering is guaranteed to not change the result of the instruction sequence, it would be helpful for better performance fetch decode execute fetch decode execute fetch decode execute Instruction 1 Instruction 3 Instruction 2 instruction 2 depends on result of instruction 1 (e.g., first instruction modifies opcode of next instruction) By reordering instruction 2 and 3, total execution time can be shorten 6 cycles for 3 instructions: 2 cycles per instruction
  • 21.
    Reordering is Legal,Encouraged Behavior, But... ● If program causality is guaranteed, any reordering is legal ● Processors and compilers can make reordering of instructions for better IPC
  • 22.
    Reordering is Legal,Encouraged Behavior, But... ● If program causality is guaranteed, any reordering is legal ● Processors and compilers can make reordering of instructions for better IPC ● program causality defined with single processor environment ● IPC focused reordering doesn’t aware programmer perspective performance goals such as throughput or latency ● On Multi-processor system, reordering could harm not only correctness, but also performance https://s-media-cache-ak0.pinimg.com/originals/c2/1a/00/c21a007e0542f7da57dacd15a86d478d.jpg Throughput or latency? Instructions Per Cycle
  • 23.
  • 24.
    Time is Relative(E = MC2 ) ● Each CPU generates their events in their time, observes effects of events in relative time ● It is impossible to define absolute order of two concurrent events; Only relative observation order is possible CPU 1 CPU 2 CPU 3 CPU 4 Generated event 1 Generated event 2 Observed event 1 followed by event 2 I observed event 2 followed by event 1 Event Bus
  • 25.
    Relative Event Propagationof Hierarchical Memory ● Most system equip hierarchical memory for better performance and space ● Propagation speed of an event to a given core can be influenced by specific sub-layer of memory If CPU 0 Message Queue is busy, CPU 2 can observe an event from CPU 1 (event A) followed by an event of CPU 0 (event B) though CPU 1 observed event B before generating event A CPU 0 CPU 1 Cache CPU 0 Message Queue CPU 1 Message Queue Memory CPU 2 CPU 3 Cache CPU 2 Message Queue CPU 3 Message Queue Bus
  • 26.
    Relative Event Propagationof Hierarchical Memory ● Most system equip hierarchical memory for better performance and space ● Propagation speed of an event to a given core can be influenced by specific sub-layer of memory If CPU 0 Message Queue is busy, CPU 2 can observe an event from CPU 0 (event A) after an event of CPU 1 (event B) though CPU 1 observed event A before generating event B CPU 0 CPU 1 Cache CPU 0 Message Queue CPU 1 Message Queue Memory CPU 2 CPU 3 Cache CPU 2 Message Queue CPU 3 Message Queue Bus Generate Event A; Event A
  • 27.
    Relative Event Propagationof Hierarchical Memory ● Most system equip hierarchical memory for better performance and space ● Propagation speed of an event to a given core can be influenced by specific sub-layer of memory If CPU 0 Message Queue is busy, CPU 2 can observe an event from CPU 0 (event A) after an event of CPU 1 (event B) though CPU 1 observed event A before generating event B CPU 0 CPU 1 Cache CPU 0 Message Queue CPU 1 Message Queue Memory CPU 2 CPU 3 Cache CPU 2 Message Queue CPU 3 Message Queue Bus Generate Event A; Seen Event A; Generate Event B; Event BEvent A
  • 28.
    Relative Event Propagationof Hierarchical Memory ● Most system equip hierarchical memory for better performance and space ● Propagation speed of an event to a given core can be influenced by specific sub-layer of memory If CPU 0 Message Queue is busy, CPU 2 can observe an event from CPU 0 (event A) after an event of CPU 1 (event B) though CPU 1 observed event A before generating event B CPU 0 CPU 1 Cache CPU 0 Message Queue CPU 1 Message Queue Memory CPU 2 CPU 3 Cache CPU 2 Message Queue CPU 3 Message Queue Bus Generate Event A; Seen Event A; Generate Event B; Seen Event B; Event A Event B Busy… ;;;
  • 29.
    Relative Event Propagationof Hierarchical Memory ● Most system equip hierarchical memory for better performance and space ● Propagation speed of an event to a given core can be influenced by specific sub-layer of memory If CPU 0 Message Queue is busy, CPU 2 can observe an event from CPU 0 (event A) after an event of CPU 1 (event B) though CPU 1 observed event A before generating event B CPU 0 CPU 1 Cache CPU 0 Message Queue CPU 1 Message Queue Memory CPU 2 CPU 3 Cache CPU 2 Message Queue CPU 3 Message Queue Bus Generate Event A; Seen Event A; Generate Event B; Seen Event B; Seen Event A; Event B Event A
  • 30.
    Cache Coherency isEventual ● It is well known that cache coherency protocol helps system memory consistency ● In actual, cache coherency guarantees eventual consistency only ● Every effect of each CPU will eventually become visible on all CPUs, but There’s no guarantee that they will become apparent in the same order on those other CPUs http://img06.deviantart.net/0663/i/2005/112/b/6/schrodinger__s_cat___2_by_firefoxcentral.jpg
  • 31.
    System with StoreBuffer and Invalidation Queue ● Store Buffer and Invalidation Queue deliver effect of event but does not guarantee order of observation on each CPU CPU 0 Cache Store Buffer Invalidation Queue Memory CPU 1 Cache Store Buffer Invalidation Queue Bus
  • 32.
  • 33.
    C-language Doesn’t KnowMulti-Processor ● By the time of initial C-language development, multi-processor was rare ● As a result, C-language has only few guarantees about memory operations on multi-processor ● Undefined behavior is allowed for undefined case https://upload.wikimedia.org/wikipedia/commons/thumb/9/95/The_C_Programming _Language,_First_Edition_Cover_(2).svg/2000px-The_C_Programming_Languag e,_First_Edition_Cover_(2).svg.png
  • 34.
    Compiler Optimizes Code ●Clever compilers try hard (really hard) to optimize code for high IPC (again, not for programmer perspective goals) ○ Converts small, private function to inline code ○ Reorder memory access code to minimize dependency ○ Simplify unnecessarily complex loops, ... ● Optimization uses term `Undefined behavior` as they want ○ It’s legal, but sometimes do insane things in programmer’s perspective ● Memory access reordering of compiler based on C-standard, which doesn’t aware multi-processor system, can generate unintended program ● Linux kernel uses compiler directives and volatile keyword to enforce memory ordering ● C11 has much more improvement, though
  • 35.
  • 36.
    Each Environment ProvidesOwn Memory Model ● Memory Model defines how memory operations are generated, what effects it makes, how their effects will be propagated ● Each programming environment like Instruction Set Architecture, Programming language, Operating system, etc defines own memory model ○ Most modern language memory models (e.g., Golang, Rust, …) aware multi-processor http://www.sciencemag.org/sites/default/files/styles/article_main_large/public/Memory.jpg?itok=4FmHo7M5
  • 37.
    Each ISA ProvidesSpecific Memory Model ● Some architectures have stricter ordering enforcement rule than others ● PA-RISC CPUS is strictest, Alpha is weakest ● Because Linux kernel supports multiple architectures, it defines its memory model based on weakest one, Alpha https://kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.2015.01.31a.pdf
  • 38.
    Synchronization Primitives ● Thoughreordering and asynchronous effect propagation is legal, synchronization primitives are necessary to write human intuitive program ● Most memory model provides synchronization primitives like atomic instructions, memory barriers, etc https://s-media-cache-ak0.pinimg.com/236x/42/bc/55/42bc55a6d7e5affe2d0dbe9c872a3df9.jpg
  • 39.
    Atomic Operations ● Atomicoperations is configured with multiple sub operations ○ E.g., compare-and-swap, fetch-and-add, test-and-set ● Atomic operations have mutual exclusiveness ○ Middle state of atomic operation execution cannot be seen by others ○ It can be thought of as small critical section that protected by a global lock ● Almost every hardware supports basic atomic operations ● In general, atomic operations are expensive ○ Misuse of atomic operations can severely degrade performance and scalability http://www.scienceclarified.com/photos/atomic-mass-3024.jpg
  • 40.
    Memory Barriers ● Toallow synchronization of memory operations, memory model provides enforcement primitives, namely, memory barriers ● In general, memory barriers guarantee effects of memory operations issued before it to be propagated to other components (e.g., processor) in the system before memory operations issued after the barrier ● In general, memory barrier is expensive operation CPU 1 CPU 2 CPU 3 READ A; WRITE B; <BARRIER>; READ C; READ A, WRITE B, than READ C occurred WRITE B, READ A, than READ C occurred READ A and WRITE B can be reordered but READ C is guaranteed to be ordered after {READ A, WRITE B}
  • 41.
    Linux Kernel MemoryModel ● Defined by weakest architecture, Alpha ○ Almost every combination of reordering is possible ● Provides rich set of atomic instructions ○ atomix_xchg(), atomic_inc_return(), atomic_dec_return(), ... ● Provides CPU level barriers, Compiler level barriers, semantic level barriers ○ Compiler barriers: WRITE_ONCE(), READ_ONCE(), barrier(), ... ○ CPU barriers: mb(), wmb(), rmb(), smp_mb(), smp_wmb(), smp_rmb(), … ○ Semantical barriers: ACQUIRE operations, RELEASE operations, … ○ For detail, refer to https://www.kernel.org/doc/Documentation/memory-barriers.txt ● Because different barrier has different overhead, only necessary barrier should be used in necessary case for high performance and scalability
  • 42.
  • 43.
    Memory Operation Reordering ●Memory Operation Reordering is totally LEGAL unless it breaks causality ● Both of CPU and Compiler can do it, even in Single Processor CPU 0 CPU 1 CPU 2 A = 1; B = 1; while (B == 0) {} C = 1; Z = C; X = A; assert(z == 0 || x == 1)
  • 44.
    Memory Operation Reordering ●Memory Operation Reordering is totally LEGAL unless it breaks causality ● Both of CPU and Compiler can do it, even in Single Processor CPU 0 CPU 1 CPU 2 A = 1; B = 1; while (B == 0) {} C = 1; Z = C; X = A; assert(z == 0 || x == 1) :)
  • 45.
    Memory Operation Reordering ●Memory Operation Reordering is totally LEGAL unless it breaks causality ● Both of CPU and Compiler can do it, even in Single Processor CPU 0 CPU 1 CPU 2 A = 1; B = 1; while (B == 1) {} C = 1; Z = C; X = A; assert(z == 0 || x == 1) :)
  • 46.
    Memory Operation Reordering ●Memory Operation Reordering is totally LEGAL unless it breaks causality ● Both of CPU and Compiler can do it, even in Single Processor CPU 0 CPU 1 CPU 2 B = 1; A = 1; while (B == 0) {} C = 1; Z = C; X = A; assert(z == 0 || x == 1)
  • 47.
    Memory Operation Reordering ●Memory Operation Reordering is totally LEGAL unless it breaks causality ● Both of CPU and Compiler can do it, even in Single Processor CPU 0 CPU 1 CPU 2 B = 1; A = 1; while (B == 0) {} C = 1; X = A; Z = C; assert(z == 0 || x == 1) ?????
  • 48.
    Memory Operation Reordering ●Memory Operation Reordering is totally LEGAL unless it breaks causality ● Both of CPU and Compiler can do it, even in Single Processor ● Memory barrier enforces operations specified before it appear as happened to operations specified after it CPU 0 CPU 1 CPU 2 A = 1; wmb(); B = 1; while (B == 0) {} mb(); C = 1; Z = C; rmb(); X = A; assert(z == 0 || x == 1)
  • 49.
    Memory Operation Reordering ●Memory Operation Reordering is totally LEGAL unless it breaks causality ● Both of CPU and Compiler can do it, even in Single Processor ● Memory barrier enforces operations specified before it appear as happened to operations specified after it ● In some architecture, even Transitivity is not guaranteed ○ Transitivity: B happened after A; C happened after B; then C happened after A CPU 0 CPU 1 CPU 2 A = 1; wmb(); B = 1; while (B == 0) {} mb(); C = 1; Z = C; rmb(); X = A; assert(z == 0 || x == 1)
  • 50.
    Transitivity for Schedulerand Workers Scheduler and each workers made consensus about order Scheduler Worker A Worker B Worker Z ... ... What time is it now? Night! Night! Night! ...
  • 51.
    Transitivity between Schedulerand Worker Scheduler and each workers made consensus about order Scheduler Worker A Worker B Worker Z ... ... Yay! ... Worker Z, all workers agreed that it’s night. Do bedmaking!
  • 52.
    Transitivity between Schedulerand Worker Scheduler and each workers made consensus about order But, worker B and worker Z didn’t made consensus Scheduler Worker A Worker B Worker Z ... ... !!?? ... Worker Z, I’m in afternoon! I didn’t tell you it’s night!
  • 53.
  • 54.
    Compiler Reordering Avoidance ●Compiler can remove loop entirely C code Assembly language code static int the_var; void loop(void) { int i; for (i = 0; i < 1000; i++) the_var++; } loop: .LFB106: .cfi_startproc addl $1000, the_var(%rip) ret .cfi_endproc .LFE106:
  • 55.
    Compiler Reordering Avoidance ●ACCESS_ONCE() is a compiler memory barrier implementation of Linux kernel ● Store to the_var could not be seen by others C code Assembly language code static int the_var; void loop(void) { int i; for (i = 0; ACCESS_ONCE(i) < 1000; i++) the_var++; } loop: ... movl the_var(%rip), %ecx .L175: ... addl $1, %eax ... cmpl $999, %edx jle .L175 movl %esi, the_var(%rip) .L170: rep ret
  • 56.
    Compiler Reordering Avoidance ●Still, store to `the_var` not issued for every iteration C code Assembly language code static int the_var; void loop(void) { int i; for (i = 0; ACCESS_ONCE(i) < 1000; i++) the_var++; } loop: ... movl the_var(%rip), %ecx .L175: ... addl $1, %eax ... cmpl $999, %edx jle .L175 movl %esi, the_var(%rip) .L170: rep ret
  • 57.
    Compiler Reordering Avoidance ●volatile enforces compiler to issue memory operation as programmer want (Note that it is not enforced to do DRAM access) ● However, repetitive LOAD may harm performance C code Assembly language code static volatile int the_var; void loop(void) { int i; for (i = 0; ACCESS_ONCE(i) < 1000; i++) the_var++; } loop: ... .L174: movl the_var(%rip), %edx ... addl $1, %edx movl %edx, the_var(%rip) ... cmpl $999, %edx jle .L174 .L170: rep ret .cfi_endproc
  • 58.
    Compiler Reordering Avoidance ●Complete memory barrier can help the case ● Does memory access once and uses register for loop condition check C code Assembly language code static int the_var; void loop(void) { int i; for (i = 0; i < 1000; i++) the_var++; barrier(); } loop: .LFB106: ... .L172: addl $1, the_var(%rip) subl $1, %eax jne .L172 rep ret .cfi_endproc
  • 59.
  • 60.
    Progress perception ● Codedoes issue LOAD and STORE, but… ● see_progress() can see no progress because change made by a processor propagates to other processor eventually, not immediately C code Assembly language code static int prgrs; void do_progress(void) { prgrs++; } void see_progress(void) { static int last_prgrs; static int seen; static int nr_seen; seen = prgrs; if (seen > last_prgrs) nr_seen++; last_prgrs = seen; } do_progress: ... addl $1, prgrs(%rip) ret ... see_progress: ... movl prgrs(%rip), %eax ... jle .L193 addl $1, nr_seen.5542(%rip) .L193: movl %eax, last_prgrs.5540(%rip) ret .cfi_endproc
  • 61.
    Progress perception ● Readbarrier and write barrier helps the situation C code Assembly language code static int prgrs; void do_progress(void) { prgrs++; smp_wmb(); } void see_progress(void) { static int last_prgrs; static int seen; static int nr_seen; smp_rmb(); seen = prgrs; if (seen > last_prgrs) nr_seen++; last_prgrs = seen; } do_progress: ... addl $1, prgrs(%rip) ... sfence ret see_progress: ... lfence ... movl prgrs(%rip), %eax ... jle .L193 addl $1, nr_seen.5542(%rip) .L193: movl %eax, last_prgrs.5540(%rip)
  • 62.
  • 63.
    Neither Loads NorStores Are Reordered with Likes CPU 0 CPU 1 STORE 1 X STORE 1 Y R1 = LOAD Y R2 = LOAD X R1 == 1 && R2 == 0 impossible
  • 64.
    Stores Are NotReordered With Earlier Loads CPU 0 CPU 1 R1 = LOAD X STORE 1 Y R2 = LOAD Y STORE 1 X R1 == 1 && R2 == 1 impossible
  • 65.
    Loads May BeReordered with Earlier Stores to Different Locations CPU 0 CPU 1 STORE 1 X R1 = LOAD Y STORE 1 Y R2 = LOAD X R1 == 0 && R2 == 0 possible
  • 66.
    Intra-Processor Forwarding IsAllowed CPU 0 CPU 1 STORE 1 X R1 = LOAD X R2 = LOAD Y STORE 1 Y R3 = LOAD Y R4 = LOAD X R2 == 0 && R4 == 0 possible
  • 67.
    Stores Are TransitivelyVisible CPU 0 CPU 1 CPU 2 STORE 1 X R1 = LOAD X STORE 1 Y R2 = LOAD Y R3 = LOAD X R1 == 1 && R2 == 1 && R3 == 0 impossible
  • 68.
    Stores Are Seenin a Consistent Order by Others CPU 0 CPU 1 CPU 2 CPU 3 STORE 1 X STORE 1 Y R1 = LOAD X R2 = LOAD Y R3 = LOAD Y R4 = LOAD X R1 == 0 && R2 == 0 && R3 == 1 && R4 == 0 impossible
  • 69.
    X86 Memory OrderingSummary ● LOAD after LOAD never reordered ● STORE after STORE never reordered ● STORE after LOAD never reordered ● STOREs are transitively visible ● STOREs are seen in consistent order by others ● Intra-processor STORE forwarding is possible ● LOAD from different location after STORE may be reordered ● In short, quite reasonably strong enough ● For more detail, refer to `Intel Architecture Software Developer’s Manual`
  • 70.
    Summary ● Nature ofParallel Land is counter-intuitive ○ Cannot define order of events without interaction ○ Ordering rule is different for different environment ○ Memory model defines their ordering rule ○ In short, they’re all mad here ● For human-intuitive and correct program, interaction is necessary ○ Every memory model provides synchronization primitives like atomic instruction and memory barrier, etc ○ Such interaction is expensive in common ● Linux kernel memory model is based on weakest memory model, Alpha ○ Kernel programmers should assume Alpha when writing architecture independent code ○ Because of the expensive cost of synchronization primitives, programmer should use only necessary primitives on necessary location
  • 71.
  • 72.
    This work bySeongJae Park is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/.
  • 73.
    These Slides HasBeen Presented For ● SW Maestro 100+ Conference 2016 (http://onoffmix.com/event/57240) ● KOSSCON 2016 (https://kosscon.kr/2016)