334 lines
16 KiB
Markdown
334 lines
16 KiB
Markdown
|
|
Vectorized implementations of field and point operations, using a
|
||
|
|
modification of the 4-way parallel formulas of Hisil, Wong, Carter,
|
||
|
|
and Dawson.
|
||
|
|
|
||
|
|
These notes explain the parallel formulas and our strategy for using
|
||
|
|
them with SIMD operations. There are two backend implementations: one
|
||
|
|
using AVX2, and the other using AVX512-IFMA.
|
||
|
|
|
||
|
|
# Overview
|
||
|
|
|
||
|
|
The 2008 paper [_Twisted Edwards Curves Revisited_][hwcd08] by Hisil,
|
||
|
|
Wong, Carter, and Dawson (HWCD) introduced the “extended coordinates”
|
||
|
|
and mixed-model representations which are used by most Edwards curve
|
||
|
|
implementations.
|
||
|
|
|
||
|
|
However, they also describe 4-way parallel formulas for point addition
|
||
|
|
and doubling: a unified addition algorithm taking an effective
|
||
|
|
\\(2\mathbf M + 1\mathbf D\\), a doubling algorithm taking an
|
||
|
|
effective \\(1\mathbf M + 1\mathbf S\\), and a dedicated (i.e., for
|
||
|
|
distinct points) addition algorithm taking an effective \\(2 \mathbf M
|
||
|
|
\\). They compare these formulas with a 2-way parallel variant of the
|
||
|
|
Montgomery ladder.
|
||
|
|
|
||
|
|
Unlike their serial formulas, which are used widely, their parallel
|
||
|
|
formulas do not seem to have been implemented in software before. The
|
||
|
|
2-way parallel Montgomery ladder was used in 2015 by Tung Chou's
|
||
|
|
`sandy2x` implementation. Curiously, however, although the [`sandy2x`
|
||
|
|
paper][sandy2x] also implements Edwards arithmetic, and cites HWCD08,
|
||
|
|
it doesn't mention their parallel Edwards formulas.
|
||
|
|
A 2015 paper by Hernández and López describes an AVX2 implementation
|
||
|
|
of X25519. Neither the paper nor the code are publicly available, but
|
||
|
|
it apparently gives only a [slight speedup][avx2trac], suggesting that
|
||
|
|
it uses a 4-way parallel Montgomery ladder rather than parallel
|
||
|
|
Edwards formulas.
|
||
|
|
|
||
|
|
The reason may be that HWCD08 describe their formulas as operating on
|
||
|
|
four independent processors, which would make a software
|
||
|
|
implementation impractical: all of the operations are too low-latency
|
||
|
|
to effectively synchronize. But a closer inspection reveals that the
|
||
|
|
(more expensive) multiplication and squaring steps are uniform, while
|
||
|
|
the instruction divergence occurs in the (much cheaper) addition and
|
||
|
|
subtraction steps. This means that a SIMD implementation can perform
|
||
|
|
the expensive steps uniformly, and handle divergence in the
|
||
|
|
inexpensive steps using masking.
|
||
|
|
|
||
|
|
These notes describe modifications to the original parallel formulas
|
||
|
|
to allow a SIMD implementation, and this module contains
|
||
|
|
implementations of the modified formulas targeting either AVX2 or
|
||
|
|
AVX512-IFMA.
|
||
|
|
|
||
|
|
# Parallel formulas in HWCD'08
|
||
|
|
|
||
|
|
The doubling formula is presented in the HWCD paper as follows:
|
||
|
|
|
||
|
|
| Cost | Processor 1 | Processor 2 | Processor 3 | Processor 4 |
|
||
|
|
|------------------|--------------------------------|--------------------------------|--------------------------------|--------------------------------|
|
||
|
|
| | idle | idle | idle | \\( R\_1 \gets X\_1 + Y\_1 \\) |
|
||
|
|
| \\(1\mathbf S\\) | \\( R\_2 \gets X\_1\^2 \\) | \\( R\_3 \gets Y\_1\^2 \\) | \\( R\_4 \gets Z\_1\^2 \\) | \\( R\_5 \gets R\_1\^2 \\) |
|
||
|
|
| | \\( R\_6 \gets R\_2 + R\_3 \\) | \\( R\_7 \gets R\_2 - R\_3 \\) | \\( R\_4 \gets 2 R\_4 \\) | idle |
|
||
|
|
| | idle | \\( R\_1 \gets R\_4 + R\_7 \\) | idle | \\( R\_2 \gets R\_6 - R\_5 \\) |
|
||
|
|
| \\(1\mathbf M\\) | \\( X\_3 \gets R\_1 R\_2 \\) | \\( Y\_3 \gets R\_6 R\_7 \\) | \\( T\_3 \gets R\_2 R\_6 \\) | \\( Z\_3 \gets R\_1 R\_7 \\) |
|
||
|
|
|
||
|
|
and the unified addition algorithm is presented as follows:
|
||
|
|
|
||
|
|
| Cost | Processor 1 | Processor 2 | Processor 3 | Processor 4 |
|
||
|
|
|------------------|--------------------------------|--------------------------------|--------------------------------|--------------------------------|
|
||
|
|
| | \\( R\_1 \gets Y\_1 - X\_1 \\) | \\( R\_2 \gets Y\_2 - X\_2 \\) | \\( R\_3 \gets Y\_1 + X\_1 \\) | \\( R\_4 \gets Y\_2 + X\_2 \\) |
|
||
|
|
| \\(1\mathbf M\\) | \\( R\_5 \gets R\_1 R\_2 \\) | \\( R\_6 \gets R\_3 R\_4 \\) | \\( R\_7 \gets T\_1 T\_2 \\) | \\( R\_8 \gets Z\_1 Z\_2 \\) |
|
||
|
|
| \\(1\mathbf D\\) | idle | idle | \\( R\_7 \gets k R\_7 \\) | \\( R\_8 \gets 2 R\_8 \\) |
|
||
|
|
| | \\( R\_1 \gets R\_6 - R\_5 \\) | \\( R\_2 \gets R\_8 - R\_7 \\) | \\( R\_3 \gets R\_8 + R\_7 \\) | \\( R\_4 \gets R\_6 + R\_5 \\) |
|
||
|
|
| \\(1\mathbf M\\) | \\( X\_3 \gets R\_1 R\_2 \\) | \\( Y\_3 \gets R\_3 R\_4 \\) | \\( T\_3 \gets R\_1 R\_4 \\) | \\( Z\_3 \gets R\_2 R\_3 \\) |
|
||
|
|
|
||
|
|
Here \\(\mathbf M\\) and \\(\mathbf S\\) represent the cost of
|
||
|
|
multiplication and squaring of generic field elements, \\(\mathbf D\\)
|
||
|
|
represents the cost of multiplication by a curve constant (in this
|
||
|
|
case \\( k = 2d \\)).
|
||
|
|
|
||
|
|
Notice that the \\(1\mathbf M\\) and \\(1\mathbf S\\) steps are
|
||
|
|
uniform. The non-uniform steps are all inexpensive additions or
|
||
|
|
subtractions, with the exception of the multiplication by the curve
|
||
|
|
constant \\(k = 2d\\):
|
||
|
|
$$
|
||
|
|
R\_7 \gets 2 d R\_7.
|
||
|
|
$$
|
||
|
|
|
||
|
|
HWCD suggest parallelising this step by breaking \\(k = 2d\\) into four
|
||
|
|
parts as \\(k = k_0 + 2\^n k_1 + 2\^{2n} k_2 + 2\^{3n} k_3 \\) and
|
||
|
|
computing \\(k_i R_7 \\) in parallel. This is quite awkward, but if
|
||
|
|
the curve constant is a ratio \\( d = d\_1/d\_2 \\), then projective
|
||
|
|
coordinates allow us to instead compute
|
||
|
|
$$
|
||
|
|
(R\_5, R\_6, R\_7, R\_8) \gets (d\_2 R\_5, d\_2 R\_6, 2d\_1 R\_7, d\_2 R\_8).
|
||
|
|
$$
|
||
|
|
This can be performed as a uniform multiplication by a vector of
|
||
|
|
constants, and if \\(d\_1, d\_2\\) are small, it is relatively
|
||
|
|
inexpensive. (This trick was suggested by Mike Hamburg).
|
||
|
|
In the Curve25519 case, we have
|
||
|
|
$$
|
||
|
|
d = \frac{d\_1}{d\_2} = \frac{-121665}{121666};
|
||
|
|
$$
|
||
|
|
Since \\(2 \cdot 121666 < 2\^{18}\\), all the constants above fit (up
|
||
|
|
to sign) in 32 bits, so this can be done in parallel as four
|
||
|
|
multiplications by small constants \\( (121666, 121666, 2\cdot 121665,
|
||
|
|
2\cdot 121666) \\), followed by a negation to compute \\( - 2\cdot 121665\\).
|
||
|
|
|
||
|
|
# Modified parallel formulas
|
||
|
|
|
||
|
|
Using the modifications sketched above, we can write SIMD-friendly
|
||
|
|
versions of the parallel formulas as follows. To avoid confusion with
|
||
|
|
the original formulas, temporary variables are named \\(S\\) instead
|
||
|
|
of \\(R\\) and are in static single-assignment form.
|
||
|
|
|
||
|
|
## Addition
|
||
|
|
|
||
|
|
To add points
|
||
|
|
\\(P_1 = (X_1 : Y_1 : Z_1 : T_1) \\)
|
||
|
|
and
|
||
|
|
\\(P_2 = (X_2 : Y_2 : Z_2 : T_2 ) \\),
|
||
|
|
we compute
|
||
|
|
$$
|
||
|
|
\begin{aligned}
|
||
|
|
(S\_0 &&,&& S\_1 &&,&& S\_2 &&,&& S\_3 )
|
||
|
|
&\gets
|
||
|
|
(Y\_1 - X\_1&&,&& Y\_1 + X\_1&&,&& Y\_2 - X\_2&&,&& Y\_2 + X\_2)
|
||
|
|
\\\\
|
||
|
|
(S\_4 &&,&& S\_5 &&,&& S\_6 &&,&& S\_7 )
|
||
|
|
&\gets
|
||
|
|
(S\_0 \cdot S\_2&&,&& S\_1 \cdot S\_3&&,&& Z\_1 \cdot Z\_2&&,&& T\_1 \cdot T\_2)
|
||
|
|
\\\\
|
||
|
|
(S\_8 &&,&& S\_9 &&,&& S\_{10} &&,&& S\_{11} )
|
||
|
|
&\gets
|
||
|
|
(d\_2 \cdot S\_4 &&,&& d\_2 \cdot S\_5 &&,&& 2 d\_2 \cdot S\_6 &&,&& 2 d\_1 \cdot S\_7 )
|
||
|
|
\\\\
|
||
|
|
(S\_{12} &&,&& S\_{13} &&,&& S\_{14} &&,&& S\_{15})
|
||
|
|
&\gets
|
||
|
|
(S\_9 - S\_8&&,&& S\_9 + S\_8&&,&& S\_{10} - S\_{11}&&,&& S\_{10} + S\_{11})
|
||
|
|
\\\\
|
||
|
|
(X\_3&&,&& Y\_3&&,&& Z\_3&&,&& T\_3)
|
||
|
|
&\gets
|
||
|
|
(S\_{12} \cdot S\_{14}&&,&& S\_{15} \cdot S\_{13}&&,&& S\_{15} \cdot S\_{14}&&,&& S\_{12} \cdot S\_{13})
|
||
|
|
\end{aligned}
|
||
|
|
$$
|
||
|
|
to obtain \\( P\_3 = (X\_3 : Y\_3 : Z\_3 : T\_3) = P\_1 + P\_2 \\).
|
||
|
|
This costs \\( 2\mathbf M + 1 \mathbf D\\).
|
||
|
|
|
||
|
|
## Readdition
|
||
|
|
|
||
|
|
If the point \\( P\_2 = (X\_2 : Y\_2 : Z\_2 : T\_2) \\) is fixed, we
|
||
|
|
can cache the multiplication of the curve constants by computing
|
||
|
|
$$
|
||
|
|
\begin{aligned}
|
||
|
|
(S\_2\' &&,&& S\_3\' &&,&& Z\_2\' &&,&& T\_2\' )
|
||
|
|
&\gets
|
||
|
|
(d\_2 \cdot (Y\_2 - X\_2)&&,&& d\_2 \cdot (Y\_1 + X\_1)&&,&& 2d\_2 \cdot Z\_2 &&,&& 2d\_1 \cdot T\_2).
|
||
|
|
\end{aligned}
|
||
|
|
$$
|
||
|
|
This costs \\( 1\mathbf D\\); with \\( (S\_2\', S\_3\', Z\_2\', T\_2\')\\)
|
||
|
|
in hand, the addition formulas above become
|
||
|
|
$$
|
||
|
|
\begin{aligned}
|
||
|
|
(S\_0 &&,&& S\_1 &&,&& Z\_1 &&,&& T\_1 )
|
||
|
|
&\gets
|
||
|
|
(Y\_1 - X\_1&&,&& Y\_1 + X\_1&&,&& Z\_1 &&,&& T\_1)
|
||
|
|
\\\\
|
||
|
|
(S\_8 &&,&& S\_9 &&,&& S\_{10} &&,&& S\_{11} )
|
||
|
|
&\gets
|
||
|
|
(S\_0 \cdot S\_2\' &&,&& S\_1 \cdot S\_3\'&&,&& Z\_1 \cdot Z\_2\' &&,&& T\_1 \cdot T\_2\')
|
||
|
|
\\\\
|
||
|
|
(S\_{12} &&,&& S\_{13} &&,&& S\_{14} &&,&& S\_{15})
|
||
|
|
&\gets
|
||
|
|
(S\_9 - S\_8&&,&& S\_9 + S\_8&&,&& S\_{10} - S\_{11}&&,&& S\_{10} + S\_{11})
|
||
|
|
\\\\
|
||
|
|
(X\_3&&,&& Y\_3&&,&& Z\_3&&,&& T\_3)
|
||
|
|
&\gets
|
||
|
|
(S\_{12} \cdot S\_{14}&&,&& S\_{15} \cdot S\_{13}&&,&& S\_{15} \cdot S\_{14}&&,&& S\_{12} \cdot S\_{13})
|
||
|
|
\end{aligned}
|
||
|
|
$$
|
||
|
|
which costs only \\( 2\mathbf M \\). This precomputation is
|
||
|
|
essentially similar to the precomputation that HWCD suggest for their
|
||
|
|
serial formulas. Because the cost of precomputation and then
|
||
|
|
readdition is the same as addition, it's sufficient to only
|
||
|
|
implement caching and readdition.
|
||
|
|
|
||
|
|
## Doubling
|
||
|
|
|
||
|
|
The non-uniform portions of the (re)addition formulas have a fairly
|
||
|
|
regular structure. Unfortunately, this is not the case for the
|
||
|
|
doubling formulas, which are much less nice.
|
||
|
|
|
||
|
|
To double a point \\( P = (X\_1 : Y\_1 : Z\_1 : T\_1) \\), we compute
|
||
|
|
$$
|
||
|
|
\begin{aligned}
|
||
|
|
(X\_1 &&,&& Y\_1 &&,&& Z\_1 &&,&& S\_0)
|
||
|
|
&\gets
|
||
|
|
(X\_1 &&,&& Y\_1 &&,&& Z\_1 &&,&& X\_1 + Y\_1)
|
||
|
|
\\\\
|
||
|
|
(S\_1 &&,&& S\_2 &&,&& S\_3 &&,&& S\_4 )
|
||
|
|
&\gets
|
||
|
|
(X\_1\^2 &&,&& Y\_1\^2&&,&& Z\_1\^2 &&,&& S\_0\^2)
|
||
|
|
\\\\
|
||
|
|
(S\_5 &&,&& S\_6 &&,&& S\_8 &&,&& S\_9 )
|
||
|
|
&\gets
|
||
|
|
(S\_1 + S\_2 &&,&& S\_1 - S\_2 &&,&& S\_1 + 2S\_3 - S\_2 &&,&& S\_1 + S\_2 - S\_4)
|
||
|
|
\\\\
|
||
|
|
(X\_3 &&,&& Y\_3 &&,&& Z\_3 &&,&& T\_3 )
|
||
|
|
&\gets
|
||
|
|
(S\_8 \cdot S\_9 &&,&& S\_5 \cdot S\_6 &&,&& S\_8 \cdot S\_6 &&,&& S\_5 \cdot S\_9)
|
||
|
|
\end{aligned}
|
||
|
|
$$
|
||
|
|
to obtain \\( P\_3 = (X\_3 : Y\_3 : Z\_3 : T\_3) = \[2\]P\_1 \\).
|
||
|
|
|
||
|
|
The intermediate step between the squaring and multiplication requires
|
||
|
|
a long chain of additions. For the IFMA-based implementation, this is not a problem; for the AVX2-based implementation, it is, but with some care and finesse, it's possible to arrange the computation without requiring an intermediate reduction.
|
||
|
|
|
||
|
|
# Implementation
|
||
|
|
|
||
|
|
These formulas aren't specific to a particular representation of field
|
||
|
|
element vectors, whose optimum choice is determined by the details of
|
||
|
|
the instruction set. However, it's not possible to perfectly separate
|
||
|
|
the implementation of the field element vectors from the
|
||
|
|
implementation of the point operations. Instead, the [`avx2`] and
|
||
|
|
[`ifma`] backends provide `ExtendedPoint` and `CachedPoint` types, and
|
||
|
|
the [`scalar_mul`] code uses one of the backend types by a type alias.
|
||
|
|
|
||
|
|
# Comparison to non-vectorized formulas
|
||
|
|
|
||
|
|
In theory, the parallel Edwards formulas seem to allow a \\(4\\)-way
|
||
|
|
speedup from parallelism. However, an actual vectorized
|
||
|
|
implementation has several slowdowns that cut into this speedup.
|
||
|
|
|
||
|
|
First, the parallel formulas can only use the available vector
|
||
|
|
multiplier. For AVX2, this is a \\( 32 \times 32 \rightarrow 64
|
||
|
|
\\)-bit integer multiplier, so the speedup from vectorization must
|
||
|
|
overcome the disadvantage of losing the \\( 64 \times 64 \rightarrow
|
||
|
|
128\\)-bit (serial) integer multiplier. The effect of this slowdown
|
||
|
|
is microarchitecture-dependent, since it requires accounting for the
|
||
|
|
total number of multiplications and additions and their relative
|
||
|
|
costs. IFMA allows using a \\( 52 \times 52 \rightarrow 104 \\)-bit
|
||
|
|
multiplier, but the high and low halves need to be computed
|
||
|
|
separately, and the reduction requires extra work because it's not
|
||
|
|
possible to pre-multiply by \\(19\\).
|
||
|
|
|
||
|
|
Second, the parallel doubling formulas incur both a theoretical and
|
||
|
|
practical slowdown. The parallel formulas described above work on the
|
||
|
|
\\( \mathbb P\^3 \\) “extended” coordinates. The \\( \mathbb P\^2 \\)
|
||
|
|
model introduced earlier by [Bernstein, Birkner, Joye, Lange, and
|
||
|
|
Peters][bbjlp08] allows slightly faster doublings, so HWCD suggest
|
||
|
|
mixing coordinate systems while performing scalar multiplication
|
||
|
|
(attributing the idea to [a 1998 paper][cmo98] by Cohen, Miyagi, and
|
||
|
|
Ono). The \\( T \\) coordinate is not required for doublings, so when
|
||
|
|
doublings are followed by doublings, its computation can be skipped.
|
||
|
|
More details on this approach and the different coordinate systems can
|
||
|
|
be found in the [`curve_models` module documentation][curve_models].
|
||
|
|
|
||
|
|
Unfortunately, this optimization is not compatible with the parallel
|
||
|
|
formulas, which cannot save time by skipping a single variable, so the
|
||
|
|
parallel doubling formulas do slightly more work when counting the
|
||
|
|
total number of field multiplications and squarings.
|
||
|
|
|
||
|
|
In addition, the parallel doubling formulas have a less regular
|
||
|
|
pattern of additions and subtractions than the parallel addition
|
||
|
|
formulas, so the vectorization overhead is proportionately greater.
|
||
|
|
Both the parallel addition and parallel doubling formulas also require
|
||
|
|
some shuffling to rearrange data within the vectors, which places more
|
||
|
|
pressure on the shuffle unit than is desirable.
|
||
|
|
|
||
|
|
This means that the speedup from using a vectorized implementation of
|
||
|
|
parallel Edwards formulas is likely to be greatest in applications
|
||
|
|
that do fewer doublings and more additions (like a large multiscalar
|
||
|
|
multiplication) rather than applications that do fewer additions and
|
||
|
|
more doublings (like a double-base scalar multiplication).
|
||
|
|
|
||
|
|
Third, Amdahl's law says that the speedup is limited to the portion
|
||
|
|
which can be parallelized. Normally, the field multiplications
|
||
|
|
dominate the cost of point operations, but with the IFMA backend, the
|
||
|
|
multiplications are so fast that the non-parallel additions end up as
|
||
|
|
a significant portion of the total time.
|
||
|
|
|
||
|
|
Fourth, current Intel CPUs perform thermal throttling when using wide
|
||
|
|
vector instructions. A detailed description can be found in §15.26 of
|
||
|
|
[the Intel Optimization Manual][intel], but using wide vector
|
||
|
|
instructions prevents the core from operating at higher frequencies.
|
||
|
|
The core can return to the higher-frequency state after 2
|
||
|
|
milliseconds, but this timer is reset every time high-power
|
||
|
|
instructions are used.
|
||
|
|
|
||
|
|
Any speedup from vectorization therefore has to be weighed against a
|
||
|
|
slowdown for the next few million instructions. For a mixed workload,
|
||
|
|
where point operations are interspersed with other tasks, this can
|
||
|
|
reduce overall performance. This implementation is therefore probably
|
||
|
|
not suitable for basic applications, like signatures, but is
|
||
|
|
worthwhile for complex applications, like zero-knowledge proofs, which
|
||
|
|
do sustained work.
|
||
|
|
|
||
|
|
# Future work
|
||
|
|
|
||
|
|
There are several directions for future improvement:
|
||
|
|
|
||
|
|
* Using the vectorized field arithmetic code to parallelize across
|
||
|
|
point operations rather than within a single point operation. This
|
||
|
|
is less flexible, but would give a speedup both from allowing use of
|
||
|
|
the faster mixed-model arithmetic and from reducing shuffle
|
||
|
|
pressure. One approach in this direction would be to implement
|
||
|
|
batched scalar-point operations using vectors of points (AoSoA
|
||
|
|
layout). This less generally useful but would give a speedup for
|
||
|
|
Bulletproofs.
|
||
|
|
|
||
|
|
* Extending the IFMA implementation to use the full width of AVX512,
|
||
|
|
either handling the extra parallelism internally to a single point
|
||
|
|
operation (by using a 2-way parallel implementation of field
|
||
|
|
arithmetic instead of a wordsliced one), or externally,
|
||
|
|
parallelizing across point operations. Internal parallelism would
|
||
|
|
be preferable but might require too much shuffle pressure. For now,
|
||
|
|
the only available CPU which runs IFMA operations executes them at
|
||
|
|
256-bits wide anyways, so this isn't yet important.
|
||
|
|
|
||
|
|
* Generalizing the implementation to NEON instructions. The current
|
||
|
|
point arithmetic code is written in terms of field element vectors,
|
||
|
|
which are in turn implemented using platform SIMD vectors. It
|
||
|
|
should be possible to write an alternate implementation of the
|
||
|
|
`FieldElement2625x4` using NEON without changing the point
|
||
|
|
arithmetic. NEON has 128-bit vectors rather than 256-bit vectors,
|
||
|
|
but this may still be worthwhile compared to a serial
|
||
|
|
implementation.
|
||
|
|
|
||
|
|
|
||
|
|
[sandy2x]: https://eprint.iacr.org/2015/943.pdf
|
||
|
|
[avx2trac]: https://trac.torproject.org/projects/tor/ticket/8897#comment:28
|
||
|
|
[hwcd08]: https://www.iacr.org/archive/asiacrypt2008/53500329/53500329.pdf
|
||
|
|
[curve_models]: https://docs.rs/curve25519-dalek/latest/curve25519-dalek/backend/serial/curve_models/index.html
|
||
|
|
[bbjlp08]: https://eprint.iacr.org/2008/013
|
||
|
|
[cmo98]: https://link.springer.com/content/pdf/10.1007%2F3-540-49649-1_6.pdf
|
||
|
|
[intel]: https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf
|