Header image.

Speeding up SipHash on Apple Silicon

SipHash is a (relatively) fast pseudorandom function (PRF) that is designed to replace the use of cryptographic general-purpose hash functions in places where they may be deemed too expensive. For example, the widely used HMAC-based one time password (HOTP), which uses the HMAC-SHA1 construction, does not need to be collision resistant. Instead, it is more important that an attacker cannot extract the key to generate an authentication code.

Unlike x86, where Siphash typically outperforms SHA-256 (even when SHA extensions are present), on Apple Silicon SipHash lags behind hardware-accelerated SHA-256. Benchmarking the reference implementation against Apple's corecrypto implementation of SHA-256 shows the latter performing at around 2.45 GB/s compared to the former's 1.79 GB/s when compiled with Clang with optimisation enabled on Apple M1 (GCC equivalent performs slightly worse at 1.65 GB/s).

This made me curious about how much better an optimised version would be given reference implementations are generally written with legibility in mind rather than performance. The TL;DR is that I managed to get an implementation that runs at around 2.37 GB/s (within 3.3% of hardware-accelerated SHA-256) for 128 KBs of data (around what could fit in a Firestorm L1 data cache) and the following explains how.

init

SipHash state is initialised by XORing parts of a 128-bit key $k$ with 4 64-bit words that encode the string "somepseudorandomlygeneratedbytes" in ASCII.

$$ \begin{aligned} v_0 &= k_0 \oplus \mathrm{736f6d6570736575}_{16} \\ v_1 &= k_1 \oplus \mathrm{646f72616e646f6d}_{16} \\ v_2 &= k_0 \oplus \mathrm{6c7967656e657261}_{16} \\ v_3 &= k_1 \oplus \mathrm{7465646279746573}_{16} \end{aligned} $$

For each line above, the compiler generates the following sequence of code (simplified for clarity) which generates a 64-bit constant from 4 moves and XORs the result to get an entry in $v$.

mov     x4, #0x6575
movk    x4, #0x7073, lsl #16
movk    x4, #0x6d65, lsl #32
movk    x4, #0x736f, lsl #48
eor     x4, x0, x4
Write the constant to x4 ($v_0$) and XOR it with the first half of the key, x0 ($k_0$).

On A64 one cannot load an arbitrary 64-bit literal in a single instruction. The Apple Silicon Optimization Guide recommends using 4 successive moves to build the 64-bit value because it takes the same number of instructions as a load from memory, 4, and has better cache performance because the data is not read from the data cache.

However, there is a better approach in this case. The A64 ISA also defines an ldp instruction that can load a pair of 64-bit values into a pair of registers with the same latency. This means instead of 16 move instructions to construct the constants, this can be done in 4 memory load instructions that cost approximately 10 cycles (accounting for address generation).

The processor can also be given a hint to discard the key data after the values have been loaded into registers by using a load pair with non-temporal hint instruction, ldnp. Because caches tend to store data close to the processor under the assumption that it is likely to be used again, this (potentially) prevents the key data from "polluting" the cache by forcing out useful data.

// Generate an address for the constants.
adrp    x9, Lv@PAGE
add     x9, x9, Lv@PAGEOFF

// Load the constants from memory into registers.
ldnp    x4, x5, [x9]
ldnp    x6, x7, [x9, #16]
Load pairs of 64-bit integers.

Not only that, but the latter makes better use of ILP because unlike the 4 moves that have to wait until another finishes (they use the same destination register), the loads above can be issued in parallel so this ends up taking far less than the 10 it would have if the code were executed serially.

update

Profiling showed that the majority of CPU time is spent in the compression function. True to Amdahl's law, improving this is where most of the gains came from.

Like most cryptographic algorithms, SipHash has limited opportunities for parallelism, so most of the gains here come from exploiting instruction-level parallelism (ILP) and avoiding suboptimal instructions.

SipRound is the core of this algorithm. It uses the following add-rotate-xor (ARX) construction which exhibits heavy sequential dependencies that hamper performance on modern out-of-order processors.

$$ \begin{aligned} &v_0 \mathrel{+}= v_1 \\ &v_1 \mathrel{\ll}= 13 \\ &v_1 \mathrel{\oplus}= v_0 \\ &v_0 \mathrel{\ll}= 32 \\ &v_2 \mathrel{+}= v_1 \\ &v_1 \mathrel{\ll}= 17 \\ &v_1 \mathrel{\oplus}= v_2 \\ &v_2 \mathrel{\ll}= 32 \\ &v_2 \mathrel{+}= v_3 \\ &v_3 \mathrel{\ll}= 16 \\ &v_3 \mathrel{\oplus}= v_2 \\ &v_0 \mathrel{+}= v_3 \\ &v_3 \mathrel{\ll}= 21 \\ &v_3 \mathrel{\oplus}= v_0 \end{aligned} $$

The compiler emits fused instructions of the following kind which are slower on Firestorm because they use depipelined operand forms. Avoiding these effectively doubles the throughput for these operations.

// 64 - 13 = 51 to turn a left rotate to right rotate.
eor     x5, x4, x5, ror #51
Extended register operand instruction.

Reordering the round function as follows (with the first half of the round on the left and the other on the right) also shows how some instructions can be executed in parallel albeit in a limited manner since this still has fundamental sequential data dependencies. Within each column, adjacent operations target different registers which allows the CPU to pipeline them efficiently.

$$ \begin{alignedat}{2} &v_0 \mathrel{+}= v_1 &\qquad &v_2 \mathrel{+}= v_1 \\ &v_2 \mathrel{+}= v_3 &\qquad &v_0 \mathrel{+}= v_3 \\ &v_1 \mathrel{\ll}= 13 &\qquad &v_1 \mathrel{\ll}= 17 \\ &v_3 \mathrel{\ll}= 16 &\qquad &v_3 \mathrel{\ll}= 21 \\ &v_1 \mathrel{\oplus}= v_0 &\qquad &v_1 \mathrel{\oplus}= v_2 \\ &v_3 \mathrel{\oplus}= v_2 &\qquad &v_3 \mathrel{\oplus}= v_0 \\ &v_0 \mathrel{\ll}= 32 &\qquad &v_2 \mathrel{\ll}= 32 \end{alignedat} $$

The observant reader might notice that this may well work with NEON registers, e.g., taking $[v_0,\ v_2]$ and $[v_1,\ v_3]$, but this is unlikely to yield an improvement: there are no generalised rotate instructions on A64 that operate on NEON registers so these would have to be emulated. Moving data in and out of these registers could also prove costly given the rest of the algorithm needs to operate on general purpose registers.

finalise

Finally, to handle the tail, the reference implementation uses a technique similar to Duff's device to merge the remaining bytes into a register to process the final block.

switch (remainder_len) {
case 7:
  m |= ((uint64_t) remainder[6]) << 48;
  [[fallthrough]];
case 6:
  m |= ((uint64_t) remainder[5]) << 40;
  [[fallthrough]];
case 5:
  m |= ((uint64_t) remainder[4]) << 32;
  [[fallthrough]];
case 4:
  m |= ((uint64_t) remainder[3]) << 24;
  [[fallthrough]];
case 3:
  m |= ((uint64_t) remainder[2]) << 16;
  [[fallthrough]];
case 2:
  m |= ((uint64_t) remainder[1]) << 8;
  [[fallthrough]];
case 1:
  m |= ((uint64_t) remainder[0]);
  break;
case 0:
  break;
}
Tail processing.

In the worst case (7 bytes remaining), this requires 7 loads and 14 arithmetic logic unit (ALU) operations, most of which cannot be done in parallel because they update the same destination register.

This can be reduced to 2 unaligned loads (which execute in parallel), a bit-field insert, and a move (even less for others e.g., length 4 only requires a single load and a move).

ldr     w12, [x2]          // Load.
ldr     w13, [x2, #3]      // Load at an offset (overlapping).
bfi     x12, x13, #24, #32 // Insert at a position.
orr     x10, x10, x12      // Merge into m.
Optimised tail processing.

These optimisations are particularly valuable for applications that hash large volumes of data or use SipHash in performance-critical paths. While this is not faster than hardware-accelerated SHA-256 (2.45 GB/s vs 2.37 GB/s), it is still around 1.32x faster than the reference implementation.

The code is available on GitHub.