chore: checkpoint before Python removal

This commit is contained in:
2026-03-26 22:33:59 +00:00
parent 683cec9307
commit e568ddf82a
29972 changed files with 11269302 additions and 2 deletions

View File

@@ -0,0 +1,12 @@
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/katex@0.16.3/dist/katex.min.css" integrity="sha384-Juol1FqnotbkyZUT5Z7gUPjQ9gzlwCENvUZTpQBAPxtusdwFLRy382PSDx5UUJ4/" crossorigin="anonymous">
<!-- The loading of KaTeX is deferred to speed up page rendering -->
<script defer src="https://cdn.jsdelivr.net/npm/katex@0.16.3/dist/katex.min.js" integrity="sha384-97gW6UIJxnlKemYavrqDHSX3SiygeOwIZhwyOKRfSaf0JWKRVj9hLASHgFTzT+0O" crossorigin="anonymous"></script>
<!-- To automatically render math in text elements, include the auto-render extension: -->
<script defer src="https://cdn.jsdelivr.net/npm/katex@0.16.3/dist/contrib/auto-render.min.js" integrity="sha384-+VBxd3r6XgURycqtZ117nYw44OOcIax56Z4dCRWbxyPt0Koah1uHoK0o4+/RRE05" crossorigin="anonymous"
onload="renderMathInElement(document.body);"></script>
<style>
.katex { font-size: 1em !important; }
</style>

View File

@@ -0,0 +1,140 @@
An AVX2 implementation of the vectorized point operation strategy.
# Field element representation
Our strategy is to implement 4-wide multiplication and squaring by
wordslicing, using one 64-bit AVX2 lane for each field element. Field
elements are represented in the usual way as 10 `u32` limbs in radix
\\(25.5\\) (i.e., alternating between \\(2\^{26}\\) for even limbs and
\\(2\^{25}\\) for odd limbs). This has the effect that passing between
the parallel 32-bit AVX2 representation and the serial 64-bit
representation (which uses radix \\(2^{51}\\)) amounts to regrouping
digits.
The field element representation is oriented around the AVX2
`vpmuludq` instruction, which multiplies the low 32 bits of each
64-bit lane of each operand to produce a 64-bit result.
```text,no_run
(a1 ?? b1 ?? c1 ?? d1 ??)
(a2 ?? b2 ?? c2 ?? d2 ??)
(a1*a2 b1*b2 c1*c2 d1*d2)
```
To unpack 32-bit values into 64-bit lanes for use in multiplication
it would be convenient to use the `vpunpck[lh]dq` instructions,
which unpack and interleave the low and high 32-bit lanes of two
source vectors.
However, the AVX2 versions of these instructions are designed to
operate only within 128-bit lanes of the 256-bit vectors, so that
interleaving the low lanes of `(a0 b0 c0 d0 a1 b1 c1 d1)` with zero
gives `(a0 00 b0 00 a1 00 b1 00)`. Instead, we pre-shuffle the data
layout as `(a0 b0 a1 b1 c0 d0 c1 d1)` so that we can unpack the
"low" and "high" parts as
```text,no_run
(a0 00 b0 00 c0 00 d0 00)
(a1 00 b1 00 c1 00 d1 00)
```
The data layout for a vector of four field elements \\( (a,b,c,d)
\\) with limbs \\( a_0, a_1, \ldots, a_9 \\) is as `[u32x8; 5]` in
the form
```text,no_run
(a0 b0 a1 b1 c0 d0 c1 d1)
(a2 b2 a3 b3 c2 d2 c3 d3)
(a4 b4 a5 b5 c4 d4 c5 d5)
(a6 b6 a7 b7 c6 d6 c7 d7)
(a8 b8 a9 b9 c8 d8 c9 d9)
```
Since this breaks cleanly into two 128-bit lanes, it may be possible
to adapt it to 128-bit vector instructions such as NEON without too
much difficulty.
# Avoiding Overflow in Doubling
To analyze the size of the field element coefficients during the
computations, we can parameterize the bounds on the limbs of each
field element by \\( b \in \mathbb R \\) representing the excess bits
above that limb's radix, so that each limb is bounded by either
\\(2\^{25+b} \\) or \\( 2\^{26+b} \\), as appropriate.
The multiplication routine requires that its inputs are bounded with
\\( b < 1.75 \\), in order to fit a multiplication by \\( 19 \\)
into 32 bits. Since \\( \lg 19 < 4.25 \\), \\( 19x < 2\^{32} \\)
when \\( x < 2\^{27.75} = 2\^{26 + 1.75} \\). However, this is only
required for one of the inputs; the other can grow up to \\( b < 2.5
\\).
In addition, the multiplication and squaring routines do not
canonically reduce their outputs, but can leave some small uncarried
excesses, so that their reduced outputs are bounded with
\\( b < 0.007 \\).
The non-parallel portion of the doubling formulas is
$$
\begin{aligned}
(S\_5 &&,&& S\_6 &&,&& S\_8 &&,&& S\_9 )
&\gets
(S\_1 + S\_2 &&,&& S\_1 - S\_2 &&,&& S\_1 + 2S\_3 - S\_2 &&,&& S\_1 + S\_2 - S\_4)
\end{aligned}
$$
Computing \\( (S\_5, S\_6, S\_8, S\_9 ) \\) as
$$
\begin{matrix}
& S\_1 & S\_1 & S\_1 & S\_1 \\\\
+& S\_2 & & & S\_2 \\\\
+& & & S\_3 & \\\\
+& & & S\_3 & \\\\
+& & 2p & 2p & 2p \\\\
-& & S\_2 & S\_2 & \\\\
-& & & & S\_4 \\\\
=& S\_5 & S\_6 & S\_8 & S\_9
\end{matrix}
$$
results in bit-excesses \\( < (1.01, 1.60, 2.33, 2.01)\\) for
\\( (S\_5, S\_6, S\_8, S\_9 ) \\). The products we want to compute
are then
$$
\begin{aligned}
X\_3 &\gets S\_8 S\_9 \leftrightarrow (2.33, 2.01) \\\\
Y\_3 &\gets S\_5 S\_6 \leftrightarrow (1.01, 1.60) \\\\
Z\_3 &\gets S\_8 S\_6 \leftrightarrow (2.33, 1.60) \\\\
T\_3 &\gets S\_5 S\_9 \leftrightarrow (1.01, 2.01)
\end{aligned}
$$
which are too large: it's not possible to arrange the multiplicands so
that one vector has \\(b < 2.5\\) and the other has \\( b < 1.75 \\).
However, if we flip the sign of \\( S\_4 = S\_0\^2 \\) during
squaring, so that we output \\(S\_4' = -S\_4 \pmod p\\), then we can
compute
$$
\begin{matrix}
& S\_1 & S\_1 & S\_1 & S\_1 \\\\
+& S\_2 & & & S\_2 \\\\
+& & & S\_3 & \\\\
+& & & S\_3 & \\\\
+& & & & S\_4' \\\\
+& & 2p & 2p & \\\\
-& & S\_2 & S\_2 & \\\\
=& S\_5 & S\_6 & S\_8 & S\_9
\end{matrix}
$$
resulting in bit-excesses \\( < (1.01, 1.60, 2.33, 1.60)\\) for
\\( (S\_5, S\_6, S\_8, S\_9 ) \\). The products we want to compute
are then
$$
\begin{aligned}
X\_3 &\gets S\_8 S\_9 \leftrightarrow (2.33, 1.60) \\\\
Y\_3 &\gets S\_5 S\_6 \leftrightarrow (1.01, 1.60) \\\\
Z\_3 &\gets S\_8 S\_6 \leftrightarrow (2.33, 1.60) \\\\
T\_3 &\gets S\_5 S\_9 \leftrightarrow (1.01, 1.60)
\end{aligned}
$$
whose right-hand sides are all bounded with \\( b < 1.75 \\) and
whose left-hand sides are all bounded with \\( b < 2.5 \\),
so that we can avoid any intermediate reductions.

View File

@@ -0,0 +1,580 @@
An AVX512-IFMA implementation of the vectorized point operation
strategy.
# IFMA instructions
AVX512-IFMA is an extension to AVX-512 consisting of two instructions:
* `vpmadd52luq`: packed multiply of unsigned 52-bit integers and add
the low 52 product bits to 64-bit accumulators;
* `vpmadd52huq`: packed multiply of unsigned 52-bit integers and add
the high 52 product bits to 64-bit accumulators;
These operate on 64-bit lanes of their source vectors, taking the low
52 bits of each lane of each source vector, computing the 104-bit
products of each pair, and then adding either the high or low 52 bits
of the 104-bit products to the 64-bit lanes of the destination vector.
The multiplication is performed internally by reusing circuitry for
floating-point arithmetic. Although these instructions are part of
AVX512, the AVX512VL (vector length) extension (present whenever IFMA
is) allows using them with 512, 256, or 128-bit operands.
This provides a major advantage to vectorized integer operations:
previously, vector operations could only use a \\(32 \times 32
\rightarrow 64\\)-bit multiplier, while serial code could use a
\\(64\times 64 \rightarrow 128\\)-bit multiplier.
## IFMA for big-integer multiplications
A detailed example of the intended use of the IFMA instructions can be
found in a 2016 paper by Gueron and Krasnov, [_Accelerating Big
Integer Arithmetic Using Intel IFMA Extensions_][2016_gueron_krasnov].
The basic idea is that multiplication of large integers (such as 1024,
2048, or more bits) can be performed as follows.
First, convert a “packed” 64-bit representation
\\[
\begin{aligned}
x &= x'_0 + x'_1 2^{64} + x'_2 2^{128} + \cdots \\\\
y &= y'_0 + y'_1 2^{64} + y'_2 2^{128} + \cdots
\end{aligned}
\\]
into a “redundant” 52-bit representation
\\[
\begin{aligned}
x &= x_0 + x_1 2^{52} + x_2 2^{104} + \cdots \\\\
y &= y_0 + y_1 2^{52} + y_2 2^{104} + \cdots
\end{aligned}
\\]
with each \\(x_i, y_j\\) in a 64-bit lane.
Writing the product as \\(z = z_0 + z_1 2^{52} + z_2 2^{104} + \cdots\\),
the “schoolbook” multiplication strategy gives
\\[
\begin{aligned}
&z_0 &&=& x_0 & y_0 & & & & & & & & \\\\
&z_1 &&=& x_1 & y_0 &+ x_0 & y_1 & & & & & & \\\\
&z_2 &&=& x_2 & y_0 &+ x_1 & y_1 &+ x_0 & y_2 & & & & \\\\
&z_3 &&=& x_3 & y_0 &+ x_2 & y_1 &+ x_1 & y_2 &+ x_0 & y_3 & & \\\\
&z_4 &&=& \vdots\\;&\\;\vdots &+ x_3 & y_1 &+ x_2 & y_2 &+ x_1 & y_3 &+ \cdots& \\\\
&z_5 &&=& & & \vdots\\;&\\;\vdots &+ x_3 & y_2 &+ x_2 & y_3 &+ \cdots& \\\\
&z_6 &&=& & & & & \vdots\\;&\\;\vdots &+ x_3 & y_3 &+ \cdots& \\\\
&z_7 &&=& & & & & & & \vdots\\;&\\;\vdots &+ \cdots& \\\\
&\vdots&&=& & & & & & & & & \ddots& \\\\
\end{aligned}
\\]
Notice that the product coefficient \\(z_k\\), representing the value
\\(z_k 2^{52k}\\), is the sum of all product terms
\\(
(x_i 2^{52 i}) (y_j 2^{52 j})
\\)
with \\(k = i + j\\).
Write the IFMA operators \\(\mathrm{lo}(a,b)\\), denoting the low
\\(52\\) bits of \\(ab\\), and
\\(\mathrm{hi}(a,b)\\), denoting the high \\(52\\) bits of
\\(ab\\).
Now we can rewrite the product terms as
\\[
\begin{aligned}
(x_i 2^{52 i}) (y_j 2^{52 j})
&=
2^{52 (i+j)}(
\mathrm{lo}(x_i, y_j) +
\mathrm{hi}(x_i, y_j) 2^{52}
)
\\\\
&=
\mathrm{lo}(x_i, y_j) 2^{52 (i+j)} +
\mathrm{hi}(x_i, y_j) 2^{52 (i+j+1)}.
\end{aligned}
\\]
This means that the low half of \\(x_i y_j\\) can be accumulated onto
the product limb \\(z_{i+j}\\) and the high half can be directly
accumulated onto the next-higher product limb \\(z_{i+j+1}\\) with no
additional operations. This allows rewriting the schoolbook
multiplication into the form
\\[
\begin{aligned}
&z_0 &&=& \mathrm{lo}(x_0,&y_0) & & & & & & & & & & \\\\
&z_1 &&=& \mathrm{lo}(x_1,&y_0) &+\mathrm{hi}(x_0,&y_0) &+\mathrm{lo}(x_0,&y_1) & & & & & & \\\\
&z_2 &&=& \mathrm{lo}(x_2,&y_0) &+\mathrm{hi}(x_1,&y_0) &+\mathrm{lo}(x_1,&y_1) &+\mathrm{hi}(x_0,&y_1) &+\mathrm{lo}(x_0,&y_2) & & \\\\
&z_3 &&=& \mathrm{lo}(x_3,&y_0) &+\mathrm{hi}(x_2,&y_0) &+\mathrm{lo}(x_2,&y_1) &+\mathrm{hi}(x_1,&y_1) &+\mathrm{lo}(x_1,&y_2) &+ \cdots& \\\\
&z_4 &&=& \vdots\\;&\\;\vdots &+\mathrm{hi}(x_3,&y_0) &+\mathrm{lo}(x_3,&y_1) &+\mathrm{hi}(x_2,&y_1) &+\mathrm{lo}(x_2,&y_2) &+ \cdots& \\\\
&z_5 &&=& & & \vdots\\;&\\;\vdots & \vdots\\;&\\;\vdots &+\mathrm{hi}(x_3,&y_1) &+\mathrm{lo}(x_3,&y_2) &+ \cdots& \\\\
&z_6 &&=& & & & & & & \vdots\\;&\\;\vdots & \vdots\\;&\\;\vdots &+ \cdots& \\\\
&\vdots&&=& & & & & & & & & & & \ddots& \\\\
\end{aligned}
\\]
Gueron and Krasnov implement multiplication by constructing vectors
out of the columns of this diagram, so that the source operands for
the IFMA instructions are of the form \\((x_0, x_1, x_2, \ldots)\\)
and \\((y_i, y_i, y_i, \ldots)\\).
After performing the multiplication,
the product terms \\(z_i\\) are then repacked into a 64-bit representation.
## An alternative strategy
The strategy described above is aimed at big-integer multiplications,
such as 1024, 2048, or 4096 bits, which would be used for applications
like RSA. However, elliptic curve cryptography uses much smaller field
sizes, such as 256 or 384 bits, so a different strategy is needed.
The parallel Edwards formulas provide parallelism at the level of the
formulas for curve operations. This means that instead of scanning
through the terms of the source operands and parallelizing *within* a
field element (as described above), we can arrange the computation in
product-scanning form and parallelize *across* field elements (as
described below).
The parallel Edwards
formulas provide 4-way parallelism, so they can be implemented using
256-bit vectors using a single 64-bit lane for each element, or using
512-bit vectors using two 64-bit lanes.
The only available CPU supporting IFMA (the
i3-8121U) executes 512-bit IFMA instructions at half rate compared to
256-bit instructions, so for now there's no throughput advantage to
using 512-bit IFMA instructions, and this implementation uses 256-bit
vectors.
To extend this to 512-bit vectors, it's only only necessary to achieve
2-way parallelism, and it's possible (with a small amount of overhead)
to create a hybrid strategy that operates entirely within 128-bit
lanes. This means that cross-lane operations can use the faster
`vpshufd` (1c latency) instead of a general shuffle instruction (3c
latency).
# Choice of radix
The inputs to IFMA instructions are 52 bits wide, so the radix \\(r\\)
used to represent a multiprecision integer must be \\( r \leq 52 \\).
The obvious choice is the "native" radix \\(r = 52\\).
As described above, this choice
has the advantage that for \\(x_i, y_j \in [0,2^{52})\\), the product term
\\[
\begin{aligned}
(x_i 2^{52 i}) (y_j 2^{52 j})
&=
2^{52 (i+j)}(
\mathrm{lo}(x_i, y_j) +
\mathrm{hi}(x_i, y_j) 2^{52}
)
\\\\
&=
\mathrm{lo}(x_i, y_j) 2^{52 (i+j)} +
\mathrm{hi}(x_i, y_j) 2^{52 (i+j+1)},
\end{aligned}
\\]
so that the low and high halves of the product can be directly accumulated
onto the product limbs.
In contrast, when using a smaller radix \\(r = 52 - k\\),
the product term has the form
\\[
\begin{aligned}
(x_i 2^{r i}) (y_j 2^{r j})
&=
2^{r (i+j)}(
\mathrm{lo}(x_i, y_j) +
\mathrm{hi}(x_i, y_j) 2^{52}
)
\\\\
&=
\mathrm{lo}(x_i, y_j) 2^{r (i+j)} +
(
\mathrm{hi}(x_i, y_j) 2^k
)
2^{r (i+j+1)}.
\end{aligned}
\\]
What's happening is that the product \\(x_i y_j\\) of size \\(2r\\)
bits is split not at \\(r\\) but at \\(52\\), so \\(k\\) product bits
are placed into the low half instead of the high half. This means
that the high half of the product cannot be directly accumulated onto
\\(z_{i+j+1}\\), but must first be multiplied by \\(2^k\\) (i.e., left
shifted by \\(k\\)). In addition, the low half of the product is
\\(52\\) bits large instead of \\(r\\) bits.
## Handling offset product terms
[Drucker and Gueron][2018_drucker_gueron] analyze the choice of radix
in the context of big-integer squaring, outlining three ways to handle
the offset product terms, before concluding that all of them are
suboptimal:
1. Shift the results after accumulation;
2. Shift the input operands before multiplication;
3. Split the MAC operation, accumulating into a zeroed register,
shifting the result, and then adding.
The first option is rejected because it could double-shift some
previously accumulated terms, the second doesn't work because the
inputs could become larger than \\(52\\) bits, and the third requires
additional instructions to handle the shifting and adding.
Based on an analysis of total number of instructions, they suggest an
addition to the instruction set, which they call `FMSA` (fused
multiply-shift-add). This would shift the result according to an 8-bit
immediate value before accumulating it into the destination register.
However, this change to the instruction set doesn't seem to be
necessary. Instead, the product terms can be grouped according to
their coefficients, accumulated together, then shifted once before
adding them to the final sum. This uses an extra register, shift, and
add, but only once per product term (accumulation target), not once
per source term (as in the Drucker-Gueron paper).
Moreover, because IFMA instructions execute only on two ports
(presumably 0 and 1), while adds and shifts can execute on three ports
(0, 1, and 5), the adds and shifts can execute independently of the
IFMA operations, as long as there is not too much pressure on port 5.
This means that, although the total number of instructions increases,
the shifts and adds do not necessarily increase the execution time, as
long as throughput is limited by IFMA operations.
Finally, because IFMA instructions have 4 cycle latency and 0.5/1
cycle throughput (for 256/512 bit vectors), maximizing IFMA throughput
requires either 8 (for 256) or 4 (for 512) independent operations. So
accumulating groups of terms independently before adding them at the
end may be necessary anyways, in order to prevent long chains of
dependent instructions.
## Advantages of a smaller radix
Using a smaller radix has other advantages. Although radix \\(52\\)
is an unsaturated representation from the point of view of the
\\(64\\)-bit accumulators (because up to 4096 product terms can be
accumulated without carries), it's a saturated representation from the
point of view of the multiplier (since \\(52\\)-bit values are the
maximum input size).
Because the inputs to a multiplication must have all of their limbs
bounded by \\(2^{52}\\), limbs in excess of \\(2^{52}\\) must be
reduced before they can be used as an input. The
[Gueron-Krasnov][2016_gueron_krasnov] paper suggests normalizing
values using a standard, sequential carry chain: for each limb, add
the carryin from reducing the previous limb, compute the carryout and
reduce the current limb, then move to the next limb.
However, when using a smaller radix, such as \\(51\\), each limb can
store a carry bit and still be used as the input to a multiplication.
This means that the inputs do not need to be normalized, and instead
of using a sequential carry chain, we can compute all carryouts in
parallel, reduce all limbs in parallel, and then add the carryins in
parallel (possibly growing the limb values by one bit).
Because the output of this partial reduction is an acceptable
multiplication input, we can "close the loop" using partial reductions
and never have to normalize to a canonical representation through the
entire computation, in contrast to the Gueron-Krasnov approach, which
converts back to a packed representation after every operation. (This
idea seems to trace back to at least as early as [this 1999
paper][1999_walter]).
Using \\(r = 51\\) is enough to keep a carry bit in each limb and
avoid normalizations. What about an even smaller radix? One reason
to choose a smaller radix would be to align the limb boundaries with
an inline reduction (for instance, choosing \\(r = 43\\) for the
Mersenne field \\(p = 2^{127} - 1\\)), but for \\(p = 2^{255 - 19}\\),
\\(r = 51 = 255/5\\) is the natural choice.
# Multiplication
The inputs to a multiplication are two field elements
\\[
\begin{aligned}
x &= x_0 + x_1 2^{51} + x_2 2^{102} + x_3 2^{153} + x_4 2^{204} \\\\
y &= y_0 + y_1 2^{51} + y_2 2^{102} + y_3 2^{153} + y_4 2^{204},
\end{aligned}
\\]
with limbs in range \\([0,2^{52})\\).
Writing the product terms as
\\[
\begin{aligned}
z &= z_0 + z_1 2^{51} + z_2 2^{102} + z_3 2^{153} + z_4 2^{204} \\\\
&+ z_5 2^{255} + z_6 2^{306} + z_7 2^{357} + z_8 2^{408} + z_9 2^{459},
\end{aligned}
\\]
a schoolbook multiplication in product scanning form takes the form
\\[
\begin{aligned}
z_0 &= x_0 y_0 \\\\
z_1 &= x_1 y_0 + x_0 y_1 \\\\
z_2 &= x_2 y_0 + x_1 y_1 + x_0 y_2 \\\\
z_3 &= x_3 y_0 + x_2 y_1 + x_1 y_2 + x_0 y_3 \\\\
z_4 &= x_4 y_0 + x_3 y_1 + x_2 y_2 + x_1 y_3 + x_0 y_4 \\\\
z_5 &= x_4 y_1 + x_3 y_2 + x_2 y_3 + x_1 y_4 \\\\
z_6 &= x_4 y_2 + x_3 y_3 + x_2 y_4 \\\\
z_7 &= x_4 y_3 + x_3 y_4 \\\\
z_8 &= x_4 y_4 \\\\
z_9 &= 0 \\\\
\end{aligned}
\\]
Each term \\(x_i y_j\\) can be written in terms of IFMA operations as
\\[
x_i y_j = \mathrm{lo}(x_i,y_j) + 2\mathrm{hi}(x_i,y_j)2^{51}.
\\]
Substituting this equation into the schoolbook multiplication, then
moving terms to eliminate the \\(2^{51}\\) factors gives
\\[
\begin{aligned}
z_0 &= \mathrm{lo}(x_0, y_0) \\\\
&+ \qquad 0 \\\\
z_1 &= \mathrm{lo}(x_1, y_0) + \mathrm{lo}(x_0, y_1) \\\\
&+ \qquad 2( \mathrm{hi}(x_0, y_0) )\\\\
z_2 &= \mathrm{lo}(x_2, y_0) + \mathrm{lo}(x_1, y_1) + \mathrm{lo}(x_0, y_2) \\\\
&+ \qquad 2( \mathrm{hi}(x_1, y_0) + \mathrm{hi}(x_0, y_1) )\\\\
z_3 &= \mathrm{lo}(x_3, y_0) + \mathrm{lo}(x_2, y_1) + \mathrm{lo}(x_1, y_2) + \mathrm{lo}(x_0, y_3) \\\\
&+ \qquad 2( \mathrm{hi}(x_2, y_0) + \mathrm{hi}(x_1, y_1) + \mathrm{hi}(x_0, y_2) )\\\\
z_4 &= \mathrm{lo}(x_4, y_0) + \mathrm{lo}(x_3, y_1) + \mathrm{lo}(x_2, y_2) + \mathrm{lo}(x_1, y_3) + \mathrm{lo}(x_0, y_4) \\\\
&+ \qquad 2( \mathrm{hi}(x_3, y_0) + \mathrm{hi}(x_2, y_1) + \mathrm{hi}(x_1, y_2) + \mathrm{hi}(x_0, y_3) )\\\\
z_5 &= \mathrm{lo}(x_4, y_1) + \mathrm{lo}(x_3, y_2) + \mathrm{lo}(x_2, y_3) + \mathrm{lo}(x_1, y_4) \\\\
&+ \qquad 2( \mathrm{hi}(x_4, y_0) + \mathrm{hi}(x_3, y_1) + \mathrm{hi}(x_2, y_2) + \mathrm{hi}(x_1, y_3) + \mathrm{hi}(x_0, y_4) )\\\\
z_6 &= \mathrm{lo}(x_4, y_2) + \mathrm{lo}(x_3, y_3) + \mathrm{lo}(x_2, y_4) \\\\
&+ \qquad 2( \mathrm{hi}(x_4, y_1) + \mathrm{hi}(x_3, y_2) + \mathrm{hi}(x_2, y_3) + \mathrm{hi}(x_1, y_4) )\\\\
z_7 &= \mathrm{lo}(x_4, y_3) + \mathrm{lo}(x_3, y_4) \\\\
&+ \qquad 2( \mathrm{hi}(x_4, y_2) + \mathrm{hi}(x_3, y_3) + \mathrm{hi}(x_2, y_4) )\\\\
z_8 &= \mathrm{lo}(x_4, y_4) \\\\
&+ \qquad 2( \mathrm{hi}(x_4, y_3) + \mathrm{hi}(x_3, y_4) )\\\\
z_9 &= 0 \\\\
&+ \qquad 2( \mathrm{hi}(x_4, y_4) )\\\\
\end{aligned}
\\]
As noted above, our strategy will be to multiply and accumulate the
terms with coefficient \\(2\\) separately from those with coefficient
\\(1\\), before combining them at the end. This can alternately be
thought of as accumulating product terms into a *doubly-redundant*
representation, with two limbs for each digit, before collapsing
the doubly-redundant representation by shifts and adds.
This computation requires 25 `vpmadd52luq` and 25 `vpmadd52huq`
operations. For 256-bit vectors, IFMA operations execute on an
i3-8121U with latency 4 cycles, throughput 0.5 cycles, so executing 50
instructions requires 25 cycles' worth of throughput. Accumulating
terms with coefficient \\(1\\) and \\(2\\) separately means that the
longest dependency chain has length 5, so the critical path has length
20 cycles and the bottleneck is throughput.
# Reduction modulo \\(p\\)
The next question is how to handle the reduction modulo \\(p\\).
Because \\(p = 2^{255} - 19\\), \\(2^{255} = 19 \pmod p\\), so we can
alternately write
\\[
\begin{aligned}
z &= z_0 + z_1 2^{51} + z_2 2^{102} + z_3 2^{153} + z_4 2^{204} \\\\
&+ z_5 2^{255} + z_6 2^{306} + z_7 2^{357} + z_8 2^{408} + z_9 2^{459}
\end{aligned}
\\]
as
\\[
\begin{aligned}
z &= (z_0 + 19z_5) + (z_1 + 19z_6) 2^{51} + (z_2 + 19z_7) 2^{102} + (z_3 + 19z_8) 2^{153} + (z_4 + 19z_9) 2^{204}.
\end{aligned}
\\]
When using a \\(64 \times 64 \rightarrow 128\\)-bit multiplier, this
can be handled (as in [Ed25519][ed25519_paper]) by premultiplying
source terms by \\(19\\). Since \\(\lg(19) < 4.25\\), this increases
their size by less than \\(4.25\\) bits, and the rest of the
multiplication can be shown to work out.
Here, we have at most \\(1\\) bit of headroom. In order to allow
premultiplication, we would need to use radix \\(2^{47}\\), which
would require six limbs instead of five. Instead, we compute the high
terms \\(z_5, \ldots, z_9\\), each using two chains of IFMA
operations, then multiply by \\(19\\) and combine with the lower terms
\\(z_0, \ldots, z_4\\). There are two ways to perform the
multiplication by \\(19\\): using more IFMA operations, or using the
`vpmullq` instruction, which computes the low \\(64\\) bits of a \\(64
\times 64\\)-bit product. However, `vpmullq` has 15c/1.5c
latency/throughput, in contrast to the 4c/0.5c latency/throughput of
IFMA operations, so it seems like a worse choice.
The high terms \\(z_5, \ldots, z_9\\) are sums of \\(52\\)-bit terms,
so they are larger than \\(52\\) bits. Write these terms in radix \\(52\\) as
\\[
z_{5+i} = z_{5+i}' + z_{5+i}'' 2^{52}, \qquad z_{5+i}' < 2^{52}.
\\]
Then the contribution of \\(z_{5+i}\\), taken modulo \\(p\\), is
\\[
\begin{aligned}
z_{5+i} 2^{255} 2^{51 i}
&=
19 (z_{5+i}' + z_{5+i}'' 2^{52}) 2^{51 i}
\\\\
&=
19 z_{5+i}' 2^{51 i} + 2 \cdot 19 z_{5+i}'' 2^{51 (i+1)}
\\\\
\end{aligned}
\\]
The products \\(19 z_{5+i}', 19 z_{5+i}''\\) can be written in terms of IFMA operations as
\\[
\begin{aligned}
19 z_{5+i}' &= \mathrm{lo}(19, z_{5+i}') + 2 \mathrm{hi}(19, z_{5+i}') 2^{51}, \\\\
19 z_{5+i}'' &= \mathrm{lo}(19, z_{5+i}'') + 2 \mathrm{hi}(19, z_{5+i}'') 2^{51}. \\\\
\end{aligned}
\\]
Because \\(z_{5+i} < 2^{64}\\), \\(z_{5+i}'' < 2^{12} \\), so \\(19
z_{5+i}'' < 2^{17} < 2^{52} \\) and \\(\mathrm{hi}(19, z_{5+i}'') = 0\\).
Because IFMA operations ignore the high bits of their source
operands, we do not need to compute \\(z\_{5+i}'\\) explicitly:
the high bits will be ignored.
Combining these observations, we can write
\\[
\begin{aligned}
z_{5+i} 2^{255} 2^{51 i}
&=
19 z_{5+i}' 2^{51 i} + 2 \cdot 19 z_{5+i}'' 2^{51 (i+1)}
\\\\
&=
\mathrm{lo}(19, z_{5+i}) 2^{51 i}
\+ 2 \mathrm{hi}(19, z_{5+i}) 2^{51 (i+1)}
\+ 2 \mathrm{lo}(19, z_{5+i}/2^{52}) 2^{51 (i+1)}.
\end{aligned}
\\]
For \\(i = 0,1,2,3\\), this allows reducing \\(z_{5+i}\\) onto
\\(z_{i}, z_{i+1}\\), and if the low terms are computed using a
doubly-redundant representation, no additional shifts are needed to
handle the \\(2\\) coefficients. For \\(i = 4\\), there's a
complication: the contribution becomes
\\[
\begin{aligned}
z_{9} 2^{255} 2^{204}
&=
\mathrm{lo}(19, z_{9}) 2^{204}
\+ 2 \mathrm{hi}(19, z_{9}) 2^{255}
\+ 2 \mathrm{lo}(19, z_{9}/2^{52}) 2^{255}
\\\\
&=
\mathrm{lo}(19, z_{9}) 2^{204}
\+ 2 \mathrm{hi}(19, z_{9}) 19
\+ 2 \mathrm{lo}(19, z_{9}/2^{52}) 19
\\\\
&=
\mathrm{lo}(19, z_{9}) 2^{204}
\+ 2
\mathrm{lo}(19, \mathrm{hi}(19, z_{9}) + \mathrm{lo}(19, z_{9}/2^{52})).
\\\\
\end{aligned}
\\]
It would be possible to cut the number of multiplications from 3 to 2
by carrying the high part of each \\(z_i\\) onto \\(z_{i+1}\\). This
would eliminate 5 multiplications, clearing 2.5 cycles of port
pressure, at the cost of 5 additions, adding 1.66 cycles of port
pressure. But doing this would create a dependency between terms
(e.g., \\(z_{5}\\) must be computed before the reduction of
\\(z_{6}\\) can begin), whereas with the approach above, all
contributions to all terms are computed independently, to maximize ILP
and flexibility for the processor to schedule instructions.
This strategy performs 16 IFMA operations, adding two IFMA operations
to each of the \\(2\\)-coefficient terms and one to each of the
\\(1\\)-coefficient terms. Considering the multiplication and
reduction together, we use 66 IFMA operations, requiring 33 cycles'
throughput, while the longest chain of IFMA operations is in the
reduction of \\(z_5\\) onto \\(z_1\\), of length 7 (so 28 cycles, plus
2 cycles to combine the two parts of \\(z_5\\), and the bottleneck is
again throughput.
Once this is done, we have computed the product terms
\\[
z = z_0 + z_1 2^{51} + z_2 2^{102} + z_3 2^{153} + z_4 2^{204},
\\]
without reducing the \\(z_i\\) to fit in \\(52\\) bits. Because the
overall flow of operations alternates multiplications and additions or
subtractions, we would have to perform a reduction after an addition
but before the next multiplication anyways, so there's no benefit to
fully reducing the limbs at the end of a multiplication. Instead, we
leave them unreduced, and track the reduction state using the type
system to ensure that unreduced limbs are not accidentally used as an
input to a multiplication.
# Squaring
Squaring operates similarly to multiplication, but with the
possibility to combine identical terms.
As before, we write the input as
\\[
\begin{aligned}
x &= x_0 + x_1 2^{51} + x_2 2^{102} + x_3 2^{153} + x_4 2^{204}
\end{aligned}
\\]
with limbs in range \\([0,2^{52})\\).
Writing the product terms as
\\[
\begin{aligned}
z &= z_0 + z_1 2^{51} + z_2 2^{102} + z_3 2^{153} + z_4 2^{204} \\\\
&+ z_5 2^{255} + z_6 2^{306} + z_7 2^{357} + z_8 2^{408} + z_9 2^{459},
\end{aligned}
\\]
a schoolbook squaring in product scanning form takes the form
\\[
\begin{aligned}
z_0 &= x_0 x_0 \\\\
z_1 &= 2 x_1 x_0 \\\\
z_2 &= 2 x_2 x_0 + x_1 x_1 \\\\
z_3 &= 2 x_3 x_0 + 2 x_2 x_1 \\\\
z_4 &= 2 x_4 x_0 + 2 x_3 x_1 + x_2 x_2 \\\\
z_5 &= 2 x_4 x_1 + 2 x_3 x_2 \\\\
z_6 &= 2 x_4 x_2 + x_3 x_3 \\\\
z_7 &= 2 x_4 x_3 \\\\
z_8 &= x_4 x_4 \\\\
z_9 &= 0 \\\\
\end{aligned}
\\]
As before, we write \\(x_i x_j\\) as
\\[
x_i x_j = \mathrm{lo}(x_i,x_j) + 2\mathrm{hi}(x_i,x_j)2^{51},
\\]
and substitute to obtain
\\[
\begin{aligned}
z_0 &= \mathrm{lo}(x_0, x_0) + 0 \\\\
z_1 &= 2 \mathrm{lo}(x_1, x_0) + 2 \mathrm{hi}(x_0, x_0) \\\\
z_2 &= 2 \mathrm{lo}(x_2, x_0) + \mathrm{lo}(x_1, x_1) + 4 \mathrm{hi}(x_1, x_0) \\\\
z_3 &= 2 \mathrm{lo}(x_3, x_0) + 2 \mathrm{lo}(x_2, x_1) + 4 \mathrm{hi}(x_2, x_0) + 2 \mathrm{hi}(x_1, x_1) \\\\
z_4 &= 2 \mathrm{lo}(x_4, x_0) + 2 \mathrm{lo}(x_3, x_1) + \mathrm{lo}(x_2, x_2) + 4 \mathrm{hi}(x_3, x_0) + 4 \mathrm{hi}(x_2, x_1) \\\\
z_5 &= 2 \mathrm{lo}(x_4, x_1) + 2 \mathrm{lo}(x_3, x_2) + 4 \mathrm{hi}(x_4, x_0) + 4 \mathrm{hi}(x_3, x_1) + 2 \mathrm{hi}(x_2, x_2) \\\\
z_6 &= 2 \mathrm{lo}(x_4, x_2) + \mathrm{lo}(x_3, x_3) + 4 \mathrm{hi}(x_4, x_1) + 4 \mathrm{hi}(x_3, x_2) \\\\
z_7 &= 2 \mathrm{lo}(x_4, x_3) + 4 \mathrm{hi}(x_4, x_2) + 2 \mathrm{hi}(x_3, x_3) \\\\
z_8 &= \mathrm{lo}(x_4, x_4) + 4 \mathrm{hi}(x_4, x_3) \\\\
z_9 &= 0 + 2 \mathrm{hi}(x_4, x_4) \\\\
\end{aligned}
\\]
To implement these, we group terms by their coefficient, computing
those with coefficient \\(2\\) on set of IFMA chains, and on another
set of chains, we begin with coefficient-\\(4\\) terms, then shift
left before continuing with the coefficient-\\(1\\) terms.
The reduction strategy is the same as for multiplication.
# Future improvements
LLVM won't use blend operations on [256-bit vectors yet][llvm_blend],
so there's a bunch of blend instructions that could be omitted.
Although the multiplications and squarings are much faster, there's no
speedup to the additions and subtractions, so there are diminishing
returns. In fact, the complications in the doubling formulas mean
that doubling is actually slower than readdition. This also suggests
that moving to 512-bit vectors won't be much help for a strategy aimed
at parallelism within a group operation, so to extract performance
gains from 512-bit vectors it will probably be necessary to create a
parallel-friendly multiscalar multiplication algorithm. This could
also help with reducing shuffle pressure.
The squaring implementation could probably be optimized, but without
`perf` support on Cannonlake it's difficult to make actual
measurements.
Another improvement would be to implement vectorized square root
computations, which would allow creating an iterator adaptor for point
decompression that bunched decompression operations and executed them
in parallel. This would accelerate batch verification.
[2016_gueron_krasnov]: https://ieeexplore.ieee.org/document/7563269
[2018_drucker_gueron]: https://eprint.iacr.org/2018/335
[1999_walter]: https://pdfs.semanticscholar.org/0e6a/3e8f30b63b556679f5dff2cbfdfe9523f4fa.pdf
[ed25519_paper]: https://ed25519.cr.yp.to/ed25519-20110926.pdf
[llvm_blend]: https://bugs.llvm.org/show_bug.cgi?id=38343

View File

@@ -0,0 +1,333 @@
Vectorized implementations of field and point operations, using a
modification of the 4-way parallel formulas of Hisil, Wong, Carter,
and Dawson.
These notes explain the parallel formulas and our strategy for using
them with SIMD operations. There are two backend implementations: one
using AVX2, and the other using AVX512-IFMA.
# Overview
The 2008 paper [_Twisted Edwards Curves Revisited_][hwcd08] by Hisil,
Wong, Carter, and Dawson (HWCD) introduced the “extended coordinates”
and mixed-model representations which are used by most Edwards curve
implementations.
However, they also describe 4-way parallel formulas for point addition
and doubling: a unified addition algorithm taking an effective
\\(2\mathbf M + 1\mathbf D\\), a doubling algorithm taking an
effective \\(1\mathbf M + 1\mathbf S\\), and a dedicated (i.e., for
distinct points) addition algorithm taking an effective \\(2 \mathbf M
\\). They compare these formulas with a 2-way parallel variant of the
Montgomery ladder.
Unlike their serial formulas, which are used widely, their parallel
formulas do not seem to have been implemented in software before. The
2-way parallel Montgomery ladder was used in 2015 by Tung Chou's
`sandy2x` implementation. Curiously, however, although the [`sandy2x`
paper][sandy2x] also implements Edwards arithmetic, and cites HWCD08,
it doesn't mention their parallel Edwards formulas.
A 2015 paper by Hernández and López describes an AVX2 implementation
of X25519. Neither the paper nor the code are publicly available, but
it apparently gives only a [slight speedup][avx2trac], suggesting that
it uses a 4-way parallel Montgomery ladder rather than parallel
Edwards formulas.
The reason may be that HWCD08 describe their formulas as operating on
four independent processors, which would make a software
implementation impractical: all of the operations are too low-latency
to effectively synchronize. But a closer inspection reveals that the
(more expensive) multiplication and squaring steps are uniform, while
the instruction divergence occurs in the (much cheaper) addition and
subtraction steps. This means that a SIMD implementation can perform
the expensive steps uniformly, and handle divergence in the
inexpensive steps using masking.
These notes describe modifications to the original parallel formulas
to allow a SIMD implementation, and this module contains
implementations of the modified formulas targeting either AVX2 or
AVX512-IFMA.
# Parallel formulas in HWCD'08
The doubling formula is presented in the HWCD paper as follows:
| Cost | Processor 1 | Processor 2 | Processor 3 | Processor 4 |
|------------------|--------------------------------|--------------------------------|--------------------------------|--------------------------------|
| | idle | idle | idle | \\( R\_1 \gets X\_1 + Y\_1 \\) |
| \\(1\mathbf S\\) | \\( R\_2 \gets X\_1\^2 \\) | \\( R\_3 \gets Y\_1\^2 \\) | \\( R\_4 \gets Z\_1\^2 \\) | \\( R\_5 \gets R\_1\^2 \\) |
| | \\( R\_6 \gets R\_2 + R\_3 \\) | \\( R\_7 \gets R\_2 - R\_3 \\) | \\( R\_4 \gets 2 R\_4 \\) | idle |
| | idle | \\( R\_1 \gets R\_4 + R\_7 \\) | idle | \\( R\_2 \gets R\_6 - R\_5 \\) |
| \\(1\mathbf M\\) | \\( X\_3 \gets R\_1 R\_2 \\) | \\( Y\_3 \gets R\_6 R\_7 \\) | \\( T\_3 \gets R\_2 R\_6 \\) | \\( Z\_3 \gets R\_1 R\_7 \\) |
and the unified addition algorithm is presented as follows:
| Cost | Processor 1 | Processor 2 | Processor 3 | Processor 4 |
|------------------|--------------------------------|--------------------------------|--------------------------------|--------------------------------|
| | \\( R\_1 \gets Y\_1 - X\_1 \\) | \\( R\_2 \gets Y\_2 - X\_2 \\) | \\( R\_3 \gets Y\_1 + X\_1 \\) | \\( R\_4 \gets Y\_2 + X\_2 \\) |
| \\(1\mathbf M\\) | \\( R\_5 \gets R\_1 R\_2 \\) | \\( R\_6 \gets R\_3 R\_4 \\) | \\( R\_7 \gets T\_1 T\_2 \\) | \\( R\_8 \gets Z\_1 Z\_2 \\) |
| \\(1\mathbf D\\) | idle | idle | \\( R\_7 \gets k R\_7 \\) | \\( R\_8 \gets 2 R\_8 \\) |
| | \\( R\_1 \gets R\_6 - R\_5 \\) | \\( R\_2 \gets R\_8 - R\_7 \\) | \\( R\_3 \gets R\_8 + R\_7 \\) | \\( R\_4 \gets R\_6 + R\_5 \\) |
| \\(1\mathbf M\\) | \\( X\_3 \gets R\_1 R\_2 \\) | \\( Y\_3 \gets R\_3 R\_4 \\) | \\( T\_3 \gets R\_1 R\_4 \\) | \\( Z\_3 \gets R\_2 R\_3 \\) |
Here \\(\mathbf M\\) and \\(\mathbf S\\) represent the cost of
multiplication and squaring of generic field elements, \\(\mathbf D\\)
represents the cost of multiplication by a curve constant (in this
case \\( k = 2d \\)).
Notice that the \\(1\mathbf M\\) and \\(1\mathbf S\\) steps are
uniform. The non-uniform steps are all inexpensive additions or
subtractions, with the exception of the multiplication by the curve
constant \\(k = 2d\\):
$$
R\_7 \gets 2 d R\_7.
$$
HWCD suggest parallelising this step by breaking \\(k = 2d\\) into four
parts as \\(k = k_0 + 2\^n k_1 + 2\^{2n} k_2 + 2\^{3n} k_3 \\) and
computing \\(k_i R_7 \\) in parallel. This is quite awkward, but if
the curve constant is a ratio \\( d = d\_1/d\_2 \\), then projective
coordinates allow us to instead compute
$$
(R\_5, R\_6, R\_7, R\_8) \gets (d\_2 R\_5, d\_2 R\_6, 2d\_1 R\_7, d\_2 R\_8).
$$
This can be performed as a uniform multiplication by a vector of
constants, and if \\(d\_1, d\_2\\) are small, it is relatively
inexpensive. (This trick was suggested by Mike Hamburg).
In the Curve25519 case, we have
$$
d = \frac{d\_1}{d\_2} = \frac{-121665}{121666};
$$
Since \\(2 \cdot 121666 < 2\^{18}\\), all the constants above fit (up
to sign) in 32 bits, so this can be done in parallel as four
multiplications by small constants \\( (121666, 121666, 2\cdot 121665,
2\cdot 121666) \\), followed by a negation to compute \\( - 2\cdot 121665\\).
# Modified parallel formulas
Using the modifications sketched above, we can write SIMD-friendly
versions of the parallel formulas as follows. To avoid confusion with
the original formulas, temporary variables are named \\(S\\) instead
of \\(R\\) and are in static single-assignment form.
## Addition
To add points
\\(P_1 = (X_1 : Y_1 : Z_1 : T_1) \\)
and
\\(P_2 = (X_2 : Y_2 : Z_2 : T_2 ) \\),
we compute
$$
\begin{aligned}
(S\_0 &&,&& S\_1 &&,&& S\_2 &&,&& S\_3 )
&\gets
(Y\_1 - X\_1&&,&& Y\_1 + X\_1&&,&& Y\_2 - X\_2&&,&& Y\_2 + X\_2)
\\\\
(S\_4 &&,&& S\_5 &&,&& S\_6 &&,&& S\_7 )
&\gets
(S\_0 \cdot S\_2&&,&& S\_1 \cdot S\_3&&,&& Z\_1 \cdot Z\_2&&,&& T\_1 \cdot T\_2)
\\\\
(S\_8 &&,&& S\_9 &&,&& S\_{10} &&,&& S\_{11} )
&\gets
(d\_2 \cdot S\_4 &&,&& d\_2 \cdot S\_5 &&,&& 2 d\_2 \cdot S\_6 &&,&& 2 d\_1 \cdot S\_7 )
\\\\
(S\_{12} &&,&& S\_{13} &&,&& S\_{14} &&,&& S\_{15})
&\gets
(S\_9 - S\_8&&,&& S\_9 + S\_8&&,&& S\_{10} - S\_{11}&&,&& S\_{10} + S\_{11})
\\\\
(X\_3&&,&& Y\_3&&,&& Z\_3&&,&& T\_3)
&\gets
(S\_{12} \cdot S\_{14}&&,&& S\_{15} \cdot S\_{13}&&,&& S\_{15} \cdot S\_{14}&&,&& S\_{12} \cdot S\_{13})
\end{aligned}
$$
to obtain \\( P\_3 = (X\_3 : Y\_3 : Z\_3 : T\_3) = P\_1 + P\_2 \\).
This costs \\( 2\mathbf M + 1 \mathbf D\\).
## Readdition
If the point \\( P\_2 = (X\_2 : Y\_2 : Z\_2 : T\_2) \\) is fixed, we
can cache the multiplication of the curve constants by computing
$$
\begin{aligned}
(S\_2\' &&,&& S\_3\' &&,&& Z\_2\' &&,&& T\_2\' )
&\gets
(d\_2 \cdot (Y\_2 - X\_2)&&,&& d\_2 \cdot (Y\_1 + X\_1)&&,&& 2d\_2 \cdot Z\_2 &&,&& 2d\_1 \cdot T\_2).
\end{aligned}
$$
This costs \\( 1\mathbf D\\); with \\( (S\_2\', S\_3\', Z\_2\', T\_2\')\\)
in hand, the addition formulas above become
$$
\begin{aligned}
(S\_0 &&,&& S\_1 &&,&& Z\_1 &&,&& T\_1 )
&\gets
(Y\_1 - X\_1&&,&& Y\_1 + X\_1&&,&& Z\_1 &&,&& T\_1)
\\\\
(S\_8 &&,&& S\_9 &&,&& S\_{10} &&,&& S\_{11} )
&\gets
(S\_0 \cdot S\_2\' &&,&& S\_1 \cdot S\_3\'&&,&& Z\_1 \cdot Z\_2\' &&,&& T\_1 \cdot T\_2\')
\\\\
(S\_{12} &&,&& S\_{13} &&,&& S\_{14} &&,&& S\_{15})
&\gets
(S\_9 - S\_8&&,&& S\_9 + S\_8&&,&& S\_{10} - S\_{11}&&,&& S\_{10} + S\_{11})
\\\\
(X\_3&&,&& Y\_3&&,&& Z\_3&&,&& T\_3)
&\gets
(S\_{12} \cdot S\_{14}&&,&& S\_{15} \cdot S\_{13}&&,&& S\_{15} \cdot S\_{14}&&,&& S\_{12} \cdot S\_{13})
\end{aligned}
$$
which costs only \\( 2\mathbf M \\). This precomputation is
essentially similar to the precomputation that HWCD suggest for their
serial formulas. Because the cost of precomputation and then
readdition is the same as addition, it's sufficient to only
implement caching and readdition.
## Doubling
The non-uniform portions of the (re)addition formulas have a fairly
regular structure. Unfortunately, this is not the case for the
doubling formulas, which are much less nice.
To double a point \\( P = (X\_1 : Y\_1 : Z\_1 : T\_1) \\), we compute
$$
\begin{aligned}
(X\_1 &&,&& Y\_1 &&,&& Z\_1 &&,&& S\_0)
&\gets
(X\_1 &&,&& Y\_1 &&,&& Z\_1 &&,&& X\_1 + Y\_1)
\\\\
(S\_1 &&,&& S\_2 &&,&& S\_3 &&,&& S\_4 )
&\gets
(X\_1\^2 &&,&& Y\_1\^2&&,&& Z\_1\^2 &&,&& S\_0\^2)
\\\\
(S\_5 &&,&& S\_6 &&,&& S\_8 &&,&& S\_9 )
&\gets
(S\_1 + S\_2 &&,&& S\_1 - S\_2 &&,&& S\_1 + 2S\_3 - S\_2 &&,&& S\_1 + S\_2 - S\_4)
\\\\
(X\_3 &&,&& Y\_3 &&,&& Z\_3 &&,&& T\_3 )
&\gets
(S\_8 \cdot S\_9 &&,&& S\_5 \cdot S\_6 &&,&& S\_8 \cdot S\_6 &&,&& S\_5 \cdot S\_9)
\end{aligned}
$$
to obtain \\( P\_3 = (X\_3 : Y\_3 : Z\_3 : T\_3) = \[2\]P\_1 \\).
The intermediate step between the squaring and multiplication requires
a long chain of additions. For the IFMA-based implementation, this is not a problem; for the AVX2-based implementation, it is, but with some care and finesse, it's possible to arrange the computation without requiring an intermediate reduction.
# Implementation
These formulas aren't specific to a particular representation of field
element vectors, whose optimum choice is determined by the details of
the instruction set. However, it's not possible to perfectly separate
the implementation of the field element vectors from the
implementation of the point operations. Instead, the [`avx2`] and
[`ifma`] backends provide `ExtendedPoint` and `CachedPoint` types, and
the [`scalar_mul`] code uses one of the backend types by a type alias.
# Comparison to non-vectorized formulas
In theory, the parallel Edwards formulas seem to allow a \\(4\\)-way
speedup from parallelism. However, an actual vectorized
implementation has several slowdowns that cut into this speedup.
First, the parallel formulas can only use the available vector
multiplier. For AVX2, this is a \\( 32 \times 32 \rightarrow 64
\\)-bit integer multiplier, so the speedup from vectorization must
overcome the disadvantage of losing the \\( 64 \times 64 \rightarrow
128\\)-bit (serial) integer multiplier. The effect of this slowdown
is microarchitecture-dependent, since it requires accounting for the
total number of multiplications and additions and their relative
costs. IFMA allows using a \\( 52 \times 52 \rightarrow 104 \\)-bit
multiplier, but the high and low halves need to be computed
separately, and the reduction requires extra work because it's not
possible to pre-multiply by \\(19\\).
Second, the parallel doubling formulas incur both a theoretical and
practical slowdown. The parallel formulas described above work on the
\\( \mathbb P\^3 \\) “extended” coordinates. The \\( \mathbb P\^2 \\)
model introduced earlier by [Bernstein, Birkner, Joye, Lange, and
Peters][bbjlp08] allows slightly faster doublings, so HWCD suggest
mixing coordinate systems while performing scalar multiplication
(attributing the idea to [a 1998 paper][cmo98] by Cohen, Miyagi, and
Ono). The \\( T \\) coordinate is not required for doublings, so when
doublings are followed by doublings, its computation can be skipped.
More details on this approach and the different coordinate systems can
be found in the [`curve_models` module documentation][curve_models].
Unfortunately, this optimization is not compatible with the parallel
formulas, which cannot save time by skipping a single variable, so the
parallel doubling formulas do slightly more work when counting the
total number of field multiplications and squarings.
In addition, the parallel doubling formulas have a less regular
pattern of additions and subtractions than the parallel addition
formulas, so the vectorization overhead is proportionately greater.
Both the parallel addition and parallel doubling formulas also require
some shuffling to rearrange data within the vectors, which places more
pressure on the shuffle unit than is desirable.
This means that the speedup from using a vectorized implementation of
parallel Edwards formulas is likely to be greatest in applications
that do fewer doublings and more additions (like a large multiscalar
multiplication) rather than applications that do fewer additions and
more doublings (like a double-base scalar multiplication).
Third, Amdahl's law says that the speedup is limited to the portion
which can be parallelized. Normally, the field multiplications
dominate the cost of point operations, but with the IFMA backend, the
multiplications are so fast that the non-parallel additions end up as
a significant portion of the total time.
Fourth, current Intel CPUs perform thermal throttling when using wide
vector instructions. A detailed description can be found in §15.26 of
[the Intel Optimization Manual][intel], but using wide vector
instructions prevents the core from operating at higher frequencies.
The core can return to the higher-frequency state after 2
milliseconds, but this timer is reset every time high-power
instructions are used.
Any speedup from vectorization therefore has to be weighed against a
slowdown for the next few million instructions. For a mixed workload,
where point operations are interspersed with other tasks, this can
reduce overall performance. This implementation is therefore probably
not suitable for basic applications, like signatures, but is
worthwhile for complex applications, like zero-knowledge proofs, which
do sustained work.
# Future work
There are several directions for future improvement:
* Using the vectorized field arithmetic code to parallelize across
point operations rather than within a single point operation. This
is less flexible, but would give a speedup both from allowing use of
the faster mixed-model arithmetic and from reducing shuffle
pressure. One approach in this direction would be to implement
batched scalar-point operations using vectors of points (AoSoA
layout). This less generally useful but would give a speedup for
Bulletproofs.
* Extending the IFMA implementation to use the full width of AVX512,
either handling the extra parallelism internally to a single point
operation (by using a 2-way parallel implementation of field
arithmetic instead of a wordsliced one), or externally,
parallelizing across point operations. Internal parallelism would
be preferable but might require too much shuffle pressure. For now,
the only available CPU which runs IFMA operations executes them at
256-bits wide anyways, so this isn't yet important.
* Generalizing the implementation to NEON instructions. The current
point arithmetic code is written in terms of field element vectors,
which are in turn implemented using platform SIMD vectors. It
should be possible to write an alternate implementation of the
`FieldElement2625x4` using NEON without changing the point
arithmetic. NEON has 128-bit vectors rather than 256-bit vectors,
but this may still be worthwhile compared to a serial
implementation.
[sandy2x]: https://eprint.iacr.org/2015/943.pdf
[avx2trac]: https://trac.torproject.org/projects/tor/ticket/8897#comment:28
[hwcd08]: https://www.iacr.org/archive/asiacrypt2008/53500329/53500329.pdf
[curve_models]: https://docs.rs/curve25519-dalek/latest/curve25519-dalek/backend/serial/curve_models/index.html
[bbjlp08]: https://eprint.iacr.org/2008/013
[cmo98]: https://link.springer.com/content/pdf/10.1007%2F3-540-49649-1_6.pdf
[intel]: https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf