Implementing a timing, cache attack resistant sort in C

Question

Disclaimer: I am well aware implementing your own crypto is a very bad idea. This is part of a master thesis, the code will not be used in practice.

As part of a larger cryptographic algorithm, I need to sort an array of constant length (small, 24 to be precise), without leaking any information on the contents of this array. As far as I know (please correct me if these are not sufficient to prevent timing and cache attacks), this means:

The sort should run in the same amount of cycles in terms of the length of the array, regardless of the particular values of the array
The sort should not branch or access memory depending on the particular values of the array

Do any such implementations exist? If not, are there any good resources on this type of programming?

To be honest, I'm even struggling with the easier subproblem, namely finding the smallest value of an array.

double arr[24]; // some input
double min = DBL_MAX;

int i;
for (i = 0; i < 24; ++i) {
    if (arr[i] < min) {
        min = arr[i];
    }
}

Would adding an else with a dummy assignment be sufficient to make it timing-safe? If so, how do I ensure the compiler (GCC in my case) doesn't undo my hard work? Would this be susceptible to cache attacks?

A sorting network plus a constant-time compare-and-swap is what you want. See here (slide 32) for the idea. Since the size of your problem is small, the cost of the network probably won't be prohibitive. — Michael F
– Michael F, Commented Apr 21, 2016 at 10:57
If you are really going that deep, you need to think about what the compiler actually makes from your code - i.e. "think in assembler". Working on "C" level will not be enough, especially with modern, very aggressively optimizing compilers. min = arr [i]most probably will work out to one single register assignment (check this ;) ), with no real externally measurable timing impact. Any elseclause would be unnecessary. In case things in an ifclause are working out to be more complex, your approach is the right one. — tofro
– tofro, Commented Apr 21, 2016 at 10:58
My intuition tells me that you can't get away with sorting some data and not leak information about its properties through branch timing and cache behavior. Why not inject noise into the system to conceal things instead? Shuffle your array with fisher-yates fed from a CSPRNG, then sort it normally with quicksort. Intuitively this feels sufficient since all timing attacks I've seen this far depend on repeated crypto operations that leak a little bit of information each time. So unless one timing attack can extract the full key used for the shuffle you should be fine. — Art
– Art, Commented Apr 21, 2016 at 11:37
Actually, my intuition is probably wrong. Injecting noise might still be a good idea, but I think you can sort without branching or touching different cache lines. Working on a proof-of-concept now. — Art
– Art, Commented Apr 21, 2016 at 12:01

2501 · Accepted Answer · 2016-04-21 11:12:19Z

2

Use a sorting network, a series of comparisons and swaps.

The swap call must not be dependent on the comparison. It must be implemented in a way to execute the same amount of instructions, regardless of the comparison result.

Like this:

void swap( int* a , int* b , bool c )
{
    const int min = c ? b : a;
    const int max = c ? a : b;
    *a = min;
    *b = max;  
}

swap( &array[0] , &array[1] , array[0] > array[1] );

Then find the sorting network and use the swaps. Here is a generator that does that for you: http://pages.ripco.net/~jgamble/nw.html

Example for 4 elements, the numbers are array indices, generated by the above link:

SWAP(0, 1);
SWAP(2, 3);
SWAP(0, 2);
SWAP(1, 3);
SWAP(1, 2);

edited Apr 21, 2016 at 11:12

answered Apr 21, 2016 at 11:06

2501

25.8k4 gold badges51 silver badges93 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Art Over a year ago

This will have different branch behavior depending on the data being sorted, so I'm pretty sure it will leak information. Timing attacks are less about instructions executed and more about branches missed and cache lines touched. You touch the same cache lines all the time, so that's good, but you still can't get away from the branches.

bkjvbx Over a year ago

@Art: Could you explain how the branching in 2501's swap could leak information? As far as I can tell the exact same instructions would be executed, except in a different order. 1 branch will be missed for any input, right?

2501 Over a year ago

@Art Branches can be removed using tricks: stackoverflow.com/questions/227383/… but in a less portable way.

Art Over a year ago

@qxwevr I don't know. This is why I don't design my own crypto. But from what I hear from people who do know is that branching is not safe regardless of how many instructions you execute.

MikeMB Over a year ago

@qxwevr: The problem with any branch on a modern microprocessor is that the branching itself can take different amounts of time, depending on whether the branch predictor guessed right or not. I don't know, if this will be a problem here, because I'd expect the ternary statement to be compiled into a conditional move (on x86), so there is no actual branch. Still, I'd rather err on the side of caution, where crypto is involved.

Art · Accepted Answer · 2016-04-21 14:24:02Z

2

This is a very dumb bubble sort that actually works and doesn't branch or change memory access behavior depending on input data. Not sure if this can be plugged into another sorting algorithm, they need their compares separate from the swaps, but maybe it's possible, working on that now.

#include <stdint.h>

static void
cmp_and_swap(uint32_t *ap, uint32_t *bp)
{
        uint32_t a = *ap;
        uint32_t b = *bp;
        int64_t c = (int64_t)a - (int64_t)b;
        uint32_t sign = ((uint64_t)c >> 63);
        uint32_t min = a * sign + b * (sign ^ 1);
        uint32_t max = b * sign + a * (sign ^ 1);
        *ap = min;
        *bp = max;
}

void
timing_sort(uint32_t *arr, int n)
{
        int i, j;
        for (i = n - 1; i >= 0; i--) {
                for (j = 0; j < i; j++) {
                        cmp_and_swap(&arr[j], &arr[j + 1]);
                }
        }
}

The cmp_and_swap function compiles to (Apple LLVM version 7.3.0 (clang-703.0.29), compiled with -O3):

_cmp_and_swap:
00000001000009e0        pushq   %rbp
00000001000009e1        movq    %rsp, %rbp
00000001000009e4        movl    (%rdi), %r8d
00000001000009e7        movl    (%rsi), %r9d
00000001000009ea        movq    %r8, %rdx
00000001000009ed        subq    %r9, %rdx
00000001000009f0        shrq    $0x3f, %rdx
00000001000009f4        movl    %edx, %r10d
00000001000009f7        negl    %r10d
00000001000009fa        orl     $-0x2, %edx
00000001000009fd        incl    %edx
00000001000009ff        movl    %r9d, %ecx
0000000100000a02        andl    %edx, %ecx
0000000100000a04        andl    %r8d, %edx
0000000100000a07        movl    %r8d, %eax
0000000100000a0a        andl    %r10d, %eax
0000000100000a0d        addl    %eax, %ecx
0000000100000a0f        andl    %r9d, %r10d
0000000100000a12        addl    %r10d, %edx
0000000100000a15        movl    %ecx, (%rdi)
0000000100000a17        movl    %edx, (%rsi)
0000000100000a19        popq    %rbp
0000000100000a1a        retq
0000000100000a1b        nopl    (%rax,%rax)

Only memory accesses are reading and writing of the array, no branches. The compiler did figure out what the multiplication actually does, quite clever actually, but it didn't use branches for that.

The casts to int64_t are necessary to avoid overflows. I'm pretty sure it can be written cleaner.

As requested, here's a compare function for doubles:

void
cmp_and_swap(double *ap, double *bp)
{
        double a = *ap;
        double b = *bp;
        int sign = signbit(a - b);
        double min = a * sign + b * (sign ^ 1);
        double max = b * sign + a * (sign ^ 1);
        *ap = min;
        *bp = max;
}

Compiled code is branchless and doesn't change memory access pattern depending on input data.

edited Apr 21, 2016 at 14:24

answered Apr 21, 2016 at 13:01

Art

20.5k1 gold badge40 silver badges66 bronze badges

6 Comments

bkjvbx Over a year ago

If I'm not mistaken, this is very similar to 2501's answer, with the "tricks" he linked in his comment to avoid branching implemented, correct? Since bubblesort is essentially a specific type of sorting network. Do you know of a similar comparison without branching for floats and/or doubles?

Art Over a year ago

yes, it's the same idea. Just wanted to dump it in form of code. For floats and doubles it should actually be simpler since extracting the sign bit is easier and we don't need to worry about overflow. I'll update in a sec.

2501 Over a year ago

Great and very portable answer. +1 (Hopefully signbit doesn't introduce any braches.) :-)

Art Over a year ago

@2501 I checked the generated code. It doesn't. Although I don't know about the timing of the instructions. Integer instructions have constant cost, but floating point could definitely behave differently depending on input.

Art Over a year ago

@qxwevr oh. of course, that was stupid of me. You want something that checks if a value is 0 or not 0 without branching. int sign = !!signbit(a - b). !! should be the most efficient way to normalize (set to 0 or 1) a boolean in an int and compilers should generate branchless code for that, if possible. I suspect that if "!!x" doesn't generate branchless code, then "a - b < 0" won't either. Btw. It's always safer to write "a < b" to avoid overflows and such. I only did it in the int example to not do comparisons which can cause branches and be able to extract the sign bit.

|

tofro · Accepted Answer · 2016-04-21 11:19:31Z

0

A very trivial, time-constant (but also highly in-efficient) sort is to

have a src and destination array
for each element in the (sorted) destination array, iterate through the complete source array to find the element that belongs exactly into this position.

No early breaks, (nearly) constant timing, not depending on even partial sortedness of the source.

answered Apr 21, 2016 at 11:19

tofro

6,09316 silver badges34 bronze badges

Collectives™ on Stack Overflow

Implementing a timing, cache attack resistant sort in C

3 Answers 3

5 Comments

6 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

6 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related