Provide clear_internal_memory() callback, document the grave performance impact of not using the callbacks

Here are some benchmarks of Argon2d 1.3 t=1 p=1 as implemented in phc-winner-argon2 as of today for the password hashing use case (not KDF) on 2x E5-2650v4 (24 cores, 48 threads) with 8x DDR4-2400 ([a server kindly provided by Packet](https://www.packet.net/bare-metal/servers/type-2-virtualization/)) running CentOS 7, built with that distro's default gcc, using AVX2 (I verified with `objdump -d`), in cumulative hashes per second after warm-up (CPUs at max turbo for all cores in use):

```
32 MiB, 48 threads, as-is: 520
32 MiB, 48 threads, malloc()/free() out of loop: 650
32 MiB, 48 threads, malloc()/free() out of loop and secure_wipe_memory() disabled: 1070
```

Per `/proc/meminfo`, the memory was allocated as `AnonHugePages`, so I don't expect further speedup from explicit use of huge pages on this Linux kernel (but there might be on CentOS 6, which per my testing wouldn't allocate huge pages transparently). 24 threads were consistently slightly slower across all of these, e.g. 980 hashes/s for the last one.

Here are some more, for 2 MiB (high request rate capacity, targeting L3 cache):

```
2 MiB, 48 threads, as-is: 15.5k
2 MiB, 48 threads, malloc()/free() out of loop: 15.7k
2 MiB, 48 threads, malloc()/free() out of loop and secure_wipe_memory() disabled: 27.4k
```

```
2 MiB, 24 threads, as-is: 20.7k
2 MiB, 24 threads, malloc()/free() out of loop: 20.5k
2 MiB, 24 threads, malloc()/free() out of loop and secure_wipe_memory() disabled: 23.3k
```

This shows that memory (de)allocation and cleansing has a combined performance cost of over 2x at least on this platform for 32 MiB, but apparently glibc is being smart at 2 MiB allocations so only the cleansing costs us at that size.

The inclusion of such overhead in default builds and its potential extent should be documented prominently (perhaps including right in the `argon2` program output, not only in some documentation file), and a way for applications to override not only the memory (de)allocation functions, but also `clear_internal_memory()` should be provided.

BTW, while the code mostly uses the `clear_internal_memory()` wrapper function, in two places it calls `secure_wipe_memory()` directly - intentional (why?) or a bug?

As a sanity check, current yescrypt t=0 p=1 (built for plain SSE2, as that still works best even on Broadwell-EP with yescrypt's default 128-bit S-box lookups, which I like for them being so numerous) computes 780 hashes/s for 32 MiB, 48 threads at default settings, 815 at pwxform rounds reduced from 6 to 3. Since it has another 1/3 of a pass over the memory (on top of Argon2's t=1), this would translate to `780*4/3 = 1040` to `815*4/3 = 1087` hashes/s for just the memory filling phase at same memory access speed, which is similar to the 1070 we see for Argon2d with the artificial slowdowns removed. That's at the same 1 KiB block size that Argon2 uses. Higher memory bandwidth is usable on this system at 2 KiB (`925*4/3 = 1233`) and 4 KiB (`1005*4/3 = 1340`, `1005*32*2*4/3*2^20/10^9 = 90 GB/s`), but these are not an option for Argon2 (block size not configurable).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Provide clear_internal_memory() callback, document the grave performance impact of not using the callbacks #231

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Provide clear_internal_memory() callback, document the grave performance impact of not using the callbacks #231

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions