cli/vendor/curve25519-dalek/docs/avx2-notes.md

An AVX2 implementation of the vectorized point operation strategy.

# Field element representation

Our strategy is to implement 4-wide multiplication and squaring by
wordslicing, using one 64-bit AVX2 lane for each field element.  Field
elements are represented in the usual way as 10 `u32` limbs in radix
\\(25.5\\) (i.e., alternating between \\(2\^{26}\\) for even limbs and
\\(2\^{25}\\) for odd limbs).  This has the effect that passing between
the parallel 32-bit AVX2 representation and the serial 64-bit
representation (which uses radix \\(2^{51}\\)) amounts to regrouping
digits.

The field element representation is oriented around the AVX2
`vpmuludq` instruction, which multiplies the low 32 bits of each
64-bit lane of each operand to produce a 64-bit result.

```text,no_run
(a1 ?? b1 ?? c1 ?? d1 ??)
(a2 ?? b2 ?? c2 ?? d2 ??)

(a1*a2 b1*b2 c1*c2 d1*d2)
```

To unpack 32-bit values into 64-bit lanes for use in multiplication
it would be convenient to use the `vpunpck[lh]dq` instructions,
which unpack and interleave the low and high 32-bit lanes of two
source vectors.
However, the AVX2 versions of these instructions are designed to
operate only within 128-bit lanes of the 256-bit vectors, so that
interleaving the low lanes of `(a0 b0 c0 d0 a1 b1 c1 d1)` with zero
gives `(a0 00 b0 00 a1 00 b1 00)`.  Instead, we pre-shuffle the data
layout as `(a0 b0 a1 b1 c0 d0 c1 d1)` so that we can unpack the
"low" and "high" parts as

```text,no_run
(a0 00 b0 00 c0 00 d0 00)
(a1 00 b1 00 c1 00 d1 00)
```

The data layout for a vector of four field elements \\( (a,b,c,d)
\\) with limbs \\( a_0, a_1, \ldots, a_9 \\) is as `[u32x8; 5]` in
the form

```text,no_run
(a0 b0 a1 b1 c0 d0 c1 d1)
(a2 b2 a3 b3 c2 d2 c3 d3)
(a4 b4 a5 b5 c4 d4 c5 d5)
(a6 b6 a7 b7 c6 d6 c7 d7)
(a8 b8 a9 b9 c8 d8 c9 d9)
```

Since this breaks cleanly into two 128-bit lanes, it may be possible
to adapt it to 128-bit vector instructions such as NEON without too
much difficulty.

# Avoiding Overflow in Doubling

To analyze the size of the field element coefficients during the
computations, we can parameterize the bounds on the limbs of each
field element by \\( b \in \mathbb R \\) representing the excess bits
above that limb's radix, so that each limb is bounded by either
\\(2\^{25+b} \\) or \\( 2\^{26+b} \\), as appropriate.

The multiplication routine requires that its inputs are bounded with
\\( b < 1.75 \\), in order to fit a multiplication by \\( 19 \\)
into 32 bits.  Since \\( \lg 19 < 4.25 \\), \\( 19x < 2\^{32} \\)
when \\( x < 2\^{27.75} = 2\^{26 + 1.75} \\).  However, this is only
required for one of the inputs; the other can grow up to \\( b < 2.5
\\).

In addition, the multiplication and squaring routines do not
canonically reduce their outputs, but can leave some small uncarried
excesses, so that their reduced outputs are bounded with
\\( b < 0.007 \\).

The non-parallel portion of the doubling formulas is
$$
\begin{aligned}
(S\_5 &&,&& S\_6 &&,&& S\_8 &&,&& S\_9 )
&\gets
(S\_1 + S\_2 &&,&& S\_1 - S\_2 &&,&& S\_1 + 2S\_3 - S\_2 &&,&& S\_1 + S\_2 - S\_4)
\end{aligned}
$$

Computing \\( (S\_5, S\_6, S\_8, S\_9 ) \\) as
$$
\begin{matrix}
 & S\_1 & S\_1 & S\_1 & S\_1 \\\\
+& S\_2 &      &      & S\_2 \\\\
+&      &      & S\_3 &      \\\\
+&      &      & S\_3 &      \\\\
+&      & 2p   & 2p   & 2p   \\\\
-&      & S\_2 & S\_2 &      \\\\
-&      &      &      & S\_4 \\\\
=& S\_5 & S\_6 & S\_8 & S\_9
\end{matrix}
$$
results in bit-excesses \\( < (1.01, 1.60, 2.33, 2.01)\\) for
\\( (S\_5, S\_6, S\_8, S\_9 ) \\).  The products we want to compute
are then
$$
\begin{aligned}
X\_3 &\gets S\_8 S\_9 \leftrightarrow (2.33, 2.01) \\\\
Y\_3 &\gets S\_5 S\_6 \leftrightarrow (1.01, 1.60) \\\\
Z\_3 &\gets S\_8 S\_6 \leftrightarrow (2.33, 1.60) \\\\
T\_3 &\gets S\_5 S\_9 \leftrightarrow (1.01, 2.01)
\end{aligned}
$$
which are too large: it's not possible to arrange the multiplicands so
that one vector has \\(b < 2.5\\) and the other has \\( b < 1.75 \\).
However, if we flip the sign of \\( S\_4 = S\_0\^2 \\) during
squaring, so that we output \\(S\_4' = -S\_4 \pmod p\\), then we can
compute
$$
\begin{matrix}
 & S\_1 & S\_1 & S\_1 & S\_1 \\\\
+& S\_2 &      &      & S\_2 \\\\
+&      &      & S\_3 &      \\\\
+&      &      & S\_3 &      \\\\
+&      &      &      & S\_4' \\\\
+&      & 2p   & 2p   &      \\\\
-&      & S\_2 & S\_2 &      \\\\
=& S\_5 & S\_6 & S\_8 & S\_9
\end{matrix}
$$
resulting in bit-excesses \\( < (1.01, 1.60, 2.33, 1.60)\\) for
\\( (S\_5, S\_6, S\_8, S\_9 ) \\).  The products we want to compute
are then
$$
\begin{aligned}
X\_3 &\gets S\_8 S\_9 \leftrightarrow (2.33, 1.60) \\\\
Y\_3 &\gets S\_5 S\_6 \leftrightarrow (1.01, 1.60) \\\\
Z\_3 &\gets S\_8 S\_6 \leftrightarrow (2.33, 1.60) \\\\
T\_3 &\gets S\_5 S\_9 \leftrightarrow (1.01, 1.60)
\end{aligned}
$$
whose right-hand sides are all bounded with \\( b < 1.75 \\) and
whose left-hand sides are all bounded with \\( b < 2.5 \\),
so that we can avoid any intermediate reductions.