Skip to content

Provide clear_internal_memory() callback, document the grave performance impact of not using the callbacks #231

@solardiz

Description

@solardiz

Here are some benchmarks of Argon2d 1.3 t=1 p=1 as implemented in phc-winner-argon2 as of today for the password hashing use case (not KDF) on 2x E5-2650v4 (24 cores, 48 threads) with 8x DDR4-2400 (a server kindly provided by Packet) running CentOS 7, built with that distro's default gcc, using AVX2 (I verified with objdump -d), in cumulative hashes per second after warm-up (CPUs at max turbo for all cores in use):

32 MiB, 48 threads, as-is: 520
32 MiB, 48 threads, malloc()/free() out of loop: 650
32 MiB, 48 threads, malloc()/free() out of loop and secure_wipe_memory() disabled: 1070

Per /proc/meminfo, the memory was allocated as AnonHugePages, so I don't expect further speedup from explicit use of huge pages on this Linux kernel (but there might be on CentOS 6, which per my testing wouldn't allocate huge pages transparently). 24 threads were consistently slightly slower across all of these, e.g. 980 hashes/s for the last one.

Here are some more, for 2 MiB (high request rate capacity, targeting L3 cache):

2 MiB, 48 threads, as-is: 15.5k
2 MiB, 48 threads, malloc()/free() out of loop: 15.7k
2 MiB, 48 threads, malloc()/free() out of loop and secure_wipe_memory() disabled: 27.4k
2 MiB, 24 threads, as-is: 20.7k
2 MiB, 24 threads, malloc()/free() out of loop: 20.5k
2 MiB, 24 threads, malloc()/free() out of loop and secure_wipe_memory() disabled: 23.3k

This shows that memory (de)allocation and cleansing has a combined performance cost of over 2x at least on this platform for 32 MiB, but apparently glibc is being smart at 2 MiB allocations so only the cleansing costs us at that size.

The inclusion of such overhead in default builds and its potential extent should be documented prominently (perhaps including right in the argon2 program output, not only in some documentation file), and a way for applications to override not only the memory (de)allocation functions, but also clear_internal_memory() should be provided.

BTW, while the code mostly uses the clear_internal_memory() wrapper function, in two places it calls secure_wipe_memory() directly - intentional (why?) or a bug?

As a sanity check, current yescrypt t=0 p=1 (built for plain SSE2, as that still works best even on Broadwell-EP with yescrypt's default 128-bit S-box lookups, which I like for them being so numerous) computes 780 hashes/s for 32 MiB, 48 threads at default settings, 815 at pwxform rounds reduced from 6 to 3. Since it has another 1/3 of a pass over the memory (on top of Argon2's t=1), this would translate to 780*4/3 = 1040 to 815*4/3 = 1087 hashes/s for just the memory filling phase at same memory access speed, which is similar to the 1070 we see for Argon2d with the artificial slowdowns removed. That's at the same 1 KiB block size that Argon2 uses. Higher memory bandwidth is usable on this system at 2 KiB (925*4/3 = 1233) and 4 KiB (1005*4/3 = 1340, 1005*32*2*4/3*2^20/10^9 = 90 GB/s), but these are not an option for Argon2 (block size not configurable).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions