chore: checkpoint before Python removal

2026-03-26 22:33:59 +00:00
parent 683cec9307
commit e568ddf82a
29972 changed files with 11269302 additions and 2 deletions
--- a/vendor/curve25519-dalek/docs/assets/rustdoc-include-katex-header.html
+++ b/vendor/curve25519-dalek/docs/assets/rustdoc-include-katex-header.html
@@ -0,0 +1,12 @@
+<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/katex@0.16.3/dist/katex.min.css" integrity="sha384-Juol1FqnotbkyZUT5Z7gUPjQ9gzlwCENvUZTpQBAPxtusdwFLRy382PSDx5UUJ4/" crossorigin="anonymous">
+
+<!-- The loading of KaTeX is deferred to speed up page rendering -->
+<script defer src="https://cdn.jsdelivr.net/npm/katex@0.16.3/dist/katex.min.js" integrity="sha384-97gW6UIJxnlKemYavrqDHSX3SiygeOwIZhwyOKRfSaf0JWKRVj9hLASHgFTzT+0O" crossorigin="anonymous"></script>
+
+<!-- To automatically render math in text elements, include the auto-render extension: -->
+<script defer src="https://cdn.jsdelivr.net/npm/katex@0.16.3/dist/contrib/auto-render.min.js" integrity="sha384-+VBxd3r6XgURycqtZ117nYw44OOcIax56Z4dCRWbxyPt0Koah1uHoK0o4+/RRE05" crossorigin="anonymous"
+    onload="renderMathInElement(document.body);"></script>
+
+<style>
+.katex { font-size: 1em !important; }
+</style>
--- a/vendor/curve25519-dalek/docs/avx2-notes.md
+++ b/vendor/curve25519-dalek/docs/avx2-notes.md
@@ -0,0 +1,140 @@
+An AVX2 implementation of the vectorized point operation strategy.
+
+# Field element representation
+
+Our strategy is to implement 4-wide multiplication and squaring by
+wordslicing, using one 64-bit AVX2 lane for each field element.  Field
+elements are represented in the usual way as 10 `u32` limbs in radix
+\\(25.5\\) (i.e., alternating between \\(2\^{26}\\) for even limbs and
+\\(2\^{25}\\) for odd limbs).  This has the effect that passing between
+the parallel 32-bit AVX2 representation and the serial 64-bit
+representation (which uses radix \\(2^{51}\\)) amounts to regrouping
+digits.
+
+The field element representation is oriented around the AVX2
+`vpmuludq` instruction, which multiplies the low 32 bits of each
+64-bit lane of each operand to produce a 64-bit result.
+
+```text,no_run
+(a1 ?? b1 ?? c1 ?? d1 ??)
+(a2 ?? b2 ?? c2 ?? d2 ??)
+
+(a1*a2 b1*b2 c1*c2 d1*d2)
+```
+
+To unpack 32-bit values into 64-bit lanes for use in multiplication
+it would be convenient to use the `vpunpck[lh]dq` instructions,
+which unpack and interleave the low and high 32-bit lanes of two
+source vectors.
+However, the AVX2 versions of these instructions are designed to
+operate only within 128-bit lanes of the 256-bit vectors, so that
+interleaving the low lanes of `(a0 b0 c0 d0 a1 b1 c1 d1)` with zero
+gives `(a0 00 b0 00 a1 00 b1 00)`.  Instead, we pre-shuffle the data
+layout as `(a0 b0 a1 b1 c0 d0 c1 d1)` so that we can unpack the
+"low" and "high" parts as
+
+```text,no_run
+(a0 00 b0 00 c0 00 d0 00)
+(a1 00 b1 00 c1 00 d1 00)
+```
+
+The data layout for a vector of four field elements \\( (a,b,c,d)
+\\) with limbs \\( a_0, a_1, \ldots, a_9 \\) is as `[u32x8; 5]` in
+the form
+
+```text,no_run
+(a0 b0 a1 b1 c0 d0 c1 d1)
+(a2 b2 a3 b3 c2 d2 c3 d3)
+(a4 b4 a5 b5 c4 d4 c5 d5)
+(a6 b6 a7 b7 c6 d6 c7 d7)
+(a8 b8 a9 b9 c8 d8 c9 d9)
+```
+
+Since this breaks cleanly into two 128-bit lanes, it may be possible
+to adapt it to 128-bit vector instructions such as NEON without too
+much difficulty.
+
+# Avoiding Overflow in Doubling
+
+To analyze the size of the field element coefficients during the
+computations, we can parameterize the bounds on the limbs of each
+field element by \\( b \in \mathbb R \\) representing the excess bits
+above that limb's radix, so that each limb is bounded by either
+\\(2\^{25+b} \\) or \\( 2\^{26+b} \\), as appropriate.
+
+The multiplication routine requires that its inputs are bounded with
+\\( b < 1.75 \\), in order to fit a multiplication by \\( 19 \\)
+into 32 bits.  Since \\( \lg 19 < 4.25 \\), \\( 19x < 2\^{32} \\)
+when \\( x < 2\^{27.75} = 2\^{26 + 1.75} \\).  However, this is only
+required for one of the inputs; the other can grow up to \\( b < 2.5
+\\).
+
+In addition, the multiplication and squaring routines do not
+canonically reduce their outputs, but can leave some small uncarried
+excesses, so that their reduced outputs are bounded with
+\\( b < 0.007 \\).
+
+The non-parallel portion of the doubling formulas is
+$$
+\begin{aligned}
+(S\_5 &&,&& S\_6 &&,&& S\_8 &&,&& S\_9 )
+&\gets
+(S\_1 + S\_2 &&,&& S\_1 - S\_2 &&,&& S\_1 + 2S\_3 - S\_2 &&,&& S\_1 + S\_2 - S\_4)
+\end{aligned}
+$$
+
+Computing \\( (S\_5, S\_6, S\_8, S\_9 ) \\) as
+$$
+\begin{matrix}
+ & S\_1 & S\_1 & S\_1 & S\_1 \\\\
+& S\_2 &      &      & S\_2 \\\\
+&      &      & S\_3 &      \\\\
+&      &      & S\_3 &      \\\\
+&      & 2p   & 2p   & 2p   \\\\
+-&      & S\_2 & S\_2 &      \\\\
+-&      &      &      & S\_4 \\\\
+=& S\_5 & S\_6 & S\_8 & S\_9
+\end{matrix}
+$$
+results in bit-excesses \\( < (1.01, 1.60, 2.33, 2.01)\\) for
+\\( (S\_5, S\_6, S\_8, S\_9 ) \\).  The products we want to compute
+are then
+$$
+\begin{aligned}
+X\_3 &\gets S\_8 S\_9 \leftrightarrow (2.33, 2.01) \\\\
+Y\_3 &\gets S\_5 S\_6 \leftrightarrow (1.01, 1.60) \\\\
+Z\_3 &\gets S\_8 S\_6 \leftrightarrow (2.33, 1.60) \\\\
+T\_3 &\gets S\_5 S\_9 \leftrightarrow (1.01, 2.01)
+\end{aligned}
+$$
+which are too large: it's not possible to arrange the multiplicands so
+that one vector has \\(b < 2.5\\) and the other has \\( b < 1.75 \\).
+However, if we flip the sign of \\( S\_4 = S\_0\^2 \\) during
+squaring, so that we output \\(S\_4' = -S\_4 \pmod p\\), then we can
+compute
+$$
+\begin{matrix}
+ & S\_1 & S\_1 & S\_1 & S\_1 \\\\
+& S\_2 &      &      & S\_2 \\\\
+&      &      & S\_3 &      \\\\
+&      &      & S\_3 &      \\\\
+&      &      &      & S\_4' \\\\
+&      & 2p   & 2p   &      \\\\
+-&      & S\_2 & S\_2 &      \\\\
+=& S\_5 & S\_6 & S\_8 & S\_9
+\end{matrix}
+$$
+resulting in bit-excesses \\( < (1.01, 1.60, 2.33, 1.60)\\) for
+\\( (S\_5, S\_6, S\_8, S\_9 ) \\).  The products we want to compute
+are then
+$$
+\begin{aligned}
+X\_3 &\gets S\_8 S\_9 \leftrightarrow (2.33, 1.60) \\\\
+Y\_3 &\gets S\_5 S\_6 \leftrightarrow (1.01, 1.60) \\\\
+Z\_3 &\gets S\_8 S\_6 \leftrightarrow (2.33, 1.60) \\\\
+T\_3 &\gets S\_5 S\_9 \leftrightarrow (1.01, 1.60)
+\end{aligned}
+$$
+whose right-hand sides are all bounded with \\( b < 1.75 \\) and
+whose left-hand sides are all bounded with \\( b < 2.5 \\),
+so that we can avoid any intermediate reductions.
--- a/vendor/curve25519-dalek/docs/ifma-notes.md
+++ b/vendor/curve25519-dalek/docs/ifma-notes.md
@@ -0,0 +1,580 @@
+An AVX512-IFMA implementation of the vectorized point operation
+strategy.
+
+# IFMA instructions
+
+AVX512-IFMA is an extension to AVX-512 consisting of two instructions:
+
+* `vpmadd52luq`: packed multiply of unsigned 52-bit integers and add
+  the low 52 product bits to 64-bit accumulators;
+* `vpmadd52huq`: packed multiply of unsigned 52-bit integers and add
+  the high 52 product bits to 64-bit accumulators;
+
+These operate on 64-bit lanes of their source vectors, taking the low
+52 bits of each lane of each source vector, computing the 104-bit
+products of each pair, and then adding either the high or low 52 bits
+of the 104-bit products to the 64-bit lanes of the destination vector.
+The multiplication is performed internally by reusing circuitry for
+floating-point arithmetic.  Although these instructions are part of
+AVX512, the AVX512VL (vector length) extension (present whenever IFMA
+is) allows using them with 512, 256, or 128-bit operands.
+
+This provides a major advantage to vectorized integer operations:
+previously, vector operations could only use a \\(32 \times 32
+\rightarrow 64\\)-bit multiplier, while serial code could use a
+\\(64\times 64 \rightarrow 128\\)-bit multiplier.
+
+## IFMA for big-integer multiplications
+
+A detailed example of the intended use of the IFMA instructions can be
+found in a 2016 paper by Gueron and Krasnov, [_Accelerating Big
+Integer Arithmetic Using Intel IFMA Extensions_][2016_gueron_krasnov].
+The basic idea is that multiplication of large integers (such as 1024,
+2048, or more bits) can be performed as follows.
+
+First, convert a “packed” 64-bit representation
+\\[
+\begin{aligned}
+x &= x'_0 + x'_1 2^{64} + x'_2 2^{128} + \cdots \\\\
+y &= y'_0 + y'_1 2^{64} + y'_2 2^{128} + \cdots 
+\end{aligned}
+\\]
+into a “redundant” 52-bit representation
+\\[
+\begin{aligned}
+x &= x_0 + x_1 2^{52} + x_2 2^{104} + \cdots \\\\
+y &= y_0 + y_1 2^{52} + y_2 2^{104} + \cdots 
+\end{aligned}
+\\]
+with each \\(x_i, y_j\\) in a 64-bit lane.
+
+Writing the product as \\(z = z_0 + z_1 2^{52} + z_2 2^{104} + \cdots\\),
+the “schoolbook” multiplication strategy gives
+\\[
+\begin{aligned}
+&z_0   &&=& x_0      &   y_0    &           &          &           &          &           &          &        &       \\\\
+&z_1   &&=& x_1      &   y_0    &+ x_0      &   y_1    &           &          &           &          &        &       \\\\
+&z_2   &&=& x_2      &   y_0    &+ x_1      &   y_1    &+ x_0      &   y_2    &           &          &        &       \\\\
+&z_3   &&=& x_3      &   y_0    &+ x_2      &   y_1    &+ x_1      &   y_2    &+ x_0      &   y_3    &        &       \\\\
+&z_4   &&=& \vdots\\;&\\;\vdots &+ x_3      &   y_1    &+ x_2      &   y_2    &+ x_1      &   y_3    &+ \cdots&       \\\\
+&z_5   &&=&          &          &  \vdots\\;&\\;\vdots &+ x_3      &   y_2    &+ x_2      &   y_3    &+ \cdots&       \\\\
+&z_6   &&=&          &          &           &          &  \vdots\\;&\\;\vdots &+ x_3      &   y_3    &+ \cdots&       \\\\
+&z_7   &&=&          &          &           &          &           &          &  \vdots\\;&\\;\vdots &+ \cdots&       \\\\
+&\vdots&&=&          &          &           &          &           &          &           &          &  \ddots&       \\\\
+\end{aligned}
+\\]
+Notice that the product coefficient \\(z_k\\), representing the value
+\\(z_k 2^{52k}\\), is the sum of all product terms
+\\(
+(x_i 2^{52 i}) (y_j 2^{52 j})
+\\)
+with \\(k = i + j\\).  
+Write the IFMA operators \\(\mathrm{lo}(a,b)\\), denoting the low
+\\(52\\) bits of \\(ab\\), and
+\\(\mathrm{hi}(a,b)\\), denoting the high \\(52\\) bits of 
+\\(ab\\).  
+Now we can rewrite the product terms as
+\\[
+\begin{aligned}
+(x_i 2^{52 i}) (y_j 2^{52 j})
+&=
+2^{52 (i+j)}(
+\mathrm{lo}(x_i, y_j) +
+\mathrm{hi}(x_i, y_j) 2^{52}
+)
+\\\\
+&=
+\mathrm{lo}(x_i, y_j) 2^{52 (i+j)} + 
+\mathrm{hi}(x_i, y_j) 2^{52 (i+j+1)}.
+\end{aligned}
+\\]
+This means that the low half of \\(x_i y_j\\) can be accumulated onto
+the product limb \\(z_{i+j}\\) and the high half can be directly
+accumulated onto the next-higher product limb \\(z_{i+j+1}\\) with no
+additional operations.  This allows rewriting the schoolbook
+multiplication into the form
+\\[
+\begin{aligned}
+&z_0   &&=& \mathrm{lo}(x_0,&y_0)      &                 &          &                 &          &                 &          &                 &          &        &     \\\\
+&z_1   &&=& \mathrm{lo}(x_1,&y_0)      &+\mathrm{hi}(x_0,&y_0)      &+\mathrm{lo}(x_0,&y_1)      &                 &          &                 &          &        &     \\\\
+&z_2   &&=& \mathrm{lo}(x_2,&y_0)      &+\mathrm{hi}(x_1,&y_0)      &+\mathrm{lo}(x_1,&y_1)      &+\mathrm{hi}(x_0,&y_1)      &+\mathrm{lo}(x_0,&y_2)      &        &     \\\\
+&z_3   &&=& \mathrm{lo}(x_3,&y_0)      &+\mathrm{hi}(x_2,&y_0)      &+\mathrm{lo}(x_2,&y_1)      &+\mathrm{hi}(x_1,&y_1)      &+\mathrm{lo}(x_1,&y_2)      &+ \cdots&     \\\\
+&z_4   &&=&        \vdots\\;&\\;\vdots &+\mathrm{hi}(x_3,&y_0)      &+\mathrm{lo}(x_3,&y_1)      &+\mathrm{hi}(x_2,&y_1)      &+\mathrm{lo}(x_2,&y_2)      &+ \cdots&     \\\\
+&z_5   &&=&                 &          &        \vdots\\;&\\;\vdots &        \vdots\\;&\\;\vdots &+\mathrm{hi}(x_3,&y_1)      &+\mathrm{lo}(x_3,&y_2)      &+ \cdots&     \\\\
+&z_6   &&=&                 &          &                 &          &                 &          &        \vdots\\;&\\;\vdots &        \vdots\\;&\\;\vdots &+ \cdots&     \\\\
+&\vdots&&=&                 &          &                 &          &                 &          &                 &          &                 &          &  \ddots&     \\\\
+\end{aligned}
+\\]
+Gueron and Krasnov implement multiplication by constructing vectors
+out of the columns of this diagram, so that the source operands for
+the IFMA instructions are of the form \\((x_0, x_1, x_2, \ldots)\\) 
+and \\((y_i, y_i, y_i, \ldots)\\). 
+After performing the multiplication,
+the product terms \\(z_i\\) are then repacked into a 64-bit representation.
+
+## An alternative strategy
+
+The strategy described above is aimed at big-integer multiplications,
+such as 1024, 2048, or 4096 bits, which would be used for applications
+like RSA.  However, elliptic curve cryptography uses much smaller field
+sizes, such as 256 or 384 bits, so a different strategy is needed.
+
+The parallel Edwards formulas provide parallelism at the level of the
+formulas for curve operations.  This means that instead of scanning
+through the terms of the source operands and parallelizing *within* a
+field element (as described above), we can arrange the computation in
+product-scanning form and parallelize *across* field elements (as
+described below).
+
+The parallel Edwards
+formulas provide 4-way parallelism, so they can be implemented using
+256-bit vectors using a single 64-bit lane for each element, or using
+512-bit vectors using two 64-bit lanes.
+The only available CPU supporting IFMA (the
+i3-8121U) executes 512-bit IFMA instructions at half rate compared to
+256-bit instructions, so for now there's no throughput advantage to
+using 512-bit IFMA instructions, and this implementation uses 256-bit
+vectors.
+
+To extend this to 512-bit vectors, it's only only necessary to achieve
+2-way parallelism, and it's possible (with a small amount of overhead)
+to create a hybrid strategy that operates entirely within 128-bit
+lanes.  This means that cross-lane operations can use the faster
+`vpshufd` (1c latency) instead of a general shuffle instruction (3c
+latency).
+
+# Choice of radix
+
+The inputs to IFMA instructions are 52 bits wide, so the radix \\(r\\)
+used to represent a multiprecision integer must be \\( r \leq 52 \\).
+The obvious choice is the "native" radix \\(r = 52\\).
+
+As described above, this choice
+has the advantage that for \\(x_i, y_j \in [0,2^{52})\\), the product term
+\\[
+\begin{aligned}
+(x_i 2^{52 i}) (y_j 2^{52 j})
+&=
+2^{52 (i+j)}(
+\mathrm{lo}(x_i, y_j) +
+\mathrm{hi}(x_i, y_j) 2^{52}
+)
+\\\\
+&=
+\mathrm{lo}(x_i, y_j) 2^{52 (i+j)} + 
+\mathrm{hi}(x_i, y_j) 2^{52 (i+j+1)},
+\end{aligned}
+\\]
+so that the low and high halves of the product can be directly accumulated 
+onto the product limbs.
+In contrast, when using a smaller radix \\(r = 52 - k\\), 
+the product term has the form
+\\[
+\begin{aligned}
+(x_i 2^{r i}) (y_j 2^{r j})
+&=
+2^{r (i+j)}(
+\mathrm{lo}(x_i, y_j) +
+\mathrm{hi}(x_i, y_j) 2^{52}
+)
+\\\\
+&=
+\mathrm{lo}(x_i, y_j) 2^{r (i+j)} + 
+(
+\mathrm{hi}(x_i, y_j) 2^k
+)
+2^{r (i+j+1)}.
+\end{aligned}
+\\]
+What's happening is that the product \\(x_i y_j\\) of size \\(2r\\)
+bits is split not at \\(r\\) but at \\(52\\), so \\(k\\) product bits
+are placed into the low half instead of the high half.  This means
+that the high half of the product cannot be directly accumulated onto
+\\(z_{i+j+1}\\), but must first be multiplied by \\(2^k\\) (i.e., left
+shifted by \\(k\\)).  In addition, the low half of the product is
+\\(52\\) bits large instead of \\(r\\) bits.
+
+## Handling offset product terms
+
+[Drucker and Gueron][2018_drucker_gueron] analyze the choice of radix
+in the context of big-integer squaring, outlining three ways to handle
+the offset product terms, before concluding that all of them are
+suboptimal:
+
+1. Shift the results after accumulation;
+2. Shift the input operands before multiplication;
+3. Split the MAC operation, accumulating into a zeroed register,
+   shifting the result, and then adding.
+   
+The first option is rejected because it could double-shift some
+previously accumulated terms, the second doesn't work because the
+inputs could become larger than \\(52\\) bits, and the third requires
+additional instructions to handle the shifting and adding.
+
+Based on an analysis of total number of instructions, they suggest an
+addition to the instruction set, which they call `FMSA` (fused
+multiply-shift-add). This would shift the result according to an 8-bit
+immediate value before accumulating it into the destination register.
+
+However, this change to the instruction set doesn't seem to be
+necessary.  Instead, the product terms can be grouped according to
+their coefficients, accumulated together, then shifted once before
+adding them to the final sum.  This uses an extra register, shift, and
+add, but only once per product term (accumulation target), not once
+per source term (as in the Drucker-Gueron paper).
+
+Moreover, because IFMA instructions execute only on two ports
+(presumably 0 and 1), while adds and shifts can execute on three ports
+(0, 1, and 5), the adds and shifts can execute independently of the
+IFMA operations, as long as there is not too much pressure on port 5.
+This means that, although the total number of instructions increases,
+the shifts and adds do not necessarily increase the execution time, as
+long as throughput is limited by IFMA operations.
+
+Finally, because IFMA instructions have 4 cycle latency and 0.5/1
+cycle throughput (for 256/512 bit vectors), maximizing IFMA throughput
+requires either 8 (for 256) or 4 (for 512) independent operations.  So
+accumulating groups of terms independently before adding them at the
+end may be necessary anyways, in order to prevent long chains of
+dependent instructions.
+
+## Advantages of a smaller radix
+
+Using a smaller radix has other advantages.  Although radix \\(52\\)
+is an unsaturated representation from the point of view of the
+\\(64\\)-bit accumulators (because up to 4096 product terms can be
+accumulated without carries), it's a saturated representation from the
+point of view of the multiplier (since \\(52\\)-bit values are the
+maximum input size).
+
+Because the inputs to a multiplication must have all of their limbs
+bounded by \\(2^{52}\\), limbs in excess of \\(2^{52}\\) must be
+reduced before they can be used as an input.  The
+[Gueron-Krasnov][2016_gueron_krasnov] paper suggests normalizing
+values using a standard, sequential carry chain: for each limb, add
+the carryin from reducing the previous limb, compute the carryout and
+reduce the current limb, then move to the next limb.
+
+However, when using a smaller radix, such as \\(51\\), each limb can
+store a carry bit and still be used as the input to a multiplication.
+This means that the inputs do not need to be normalized, and instead
+of using a sequential carry chain, we can compute all carryouts in
+parallel, reduce all limbs in parallel, and then add the carryins in
+parallel (possibly growing the limb values by one bit).
+
+Because the output of this partial reduction is an acceptable
+multiplication input, we can "close the loop" using partial reductions
+and never have to normalize to a canonical representation through the
+entire computation, in contrast to the Gueron-Krasnov approach, which
+converts back to a packed representation after every operation.  (This
+idea seems to trace back to at least as early as [this 1999
+paper][1999_walter]).
+
+Using \\(r = 51\\) is enough to keep a carry bit in each limb and
+avoid normalizations.  What about an even smaller radix?  One reason
+to choose a smaller radix would be to align the limb boundaries with
+an inline reduction (for instance, choosing \\(r = 43\\) for the
+Mersenne field \\(p = 2^{127} - 1\\)), but for \\(p = 2^{255 - 19}\\),
+\\(r = 51 = 255/5\\) is the natural choice.
+
+# Multiplication
+
+The inputs to a multiplication are two field elements
+\\[
+\begin{aligned}
+x &= x_0 + x_1 2^{51} + x_2 2^{102} + x_3 2^{153} + x_4 2^{204} \\\\
+y &= y_0 + y_1 2^{51} + y_2 2^{102} + y_3 2^{153} + y_4 2^{204},
+\end{aligned}
+\\]
+with limbs in range \\([0,2^{52})\\).  
+
+Writing the product terms as
+\\[
+\begin{aligned}
+z &= z_0 + z_1 2^{51} + z_2 2^{102} + z_3 2^{153} + z_4 2^{204} \\\\
+  &+ z_5 2^{255} + z_6 2^{306} + z_7 2^{357} + z_8 2^{408} + z_9 2^{459},
+\end{aligned}
+\\]
+a schoolbook multiplication in product scanning form takes the form
+\\[
+\begin{aligned}
+z_0 &= x_0 y_0 \\\\
+z_1 &= x_1 y_0 + x_0 y_1 \\\\
+z_2 &= x_2 y_0 + x_1 y_1 + x_0 y_2 \\\\
+z_3 &= x_3 y_0 + x_2 y_1 + x_1 y_2 + x_0 y_3 \\\\
+z_4 &= x_4 y_0 + x_3 y_1 + x_2 y_2 + x_1 y_3 + x_0 y_4 \\\\
+z_5 &=           x_4 y_1 + x_3 y_2 + x_2 y_3 + x_1 y_4 \\\\
+z_6 &=                     x_4 y_2 + x_3 y_3 + x_2 y_4 \\\\
+z_7 &=                               x_4 y_3 + x_3 y_4 \\\\
+z_8 &=                                         x_4 y_4 \\\\
+z_9 &= 0 \\\\
+\end{aligned}
+\\]
+Each term \\(x_i y_j\\) can be written in terms of IFMA operations as
+\\[
+x_i y_j = \mathrm{lo}(x_i,y_j) + 2\mathrm{hi}(x_i,y_j)2^{51}.
+\\]
+Substituting this equation into the schoolbook multiplication, then
+moving terms to eliminate the \\(2^{51}\\) factors gives
+\\[
+\begin{aligned}
+z_0 &= \mathrm{lo}(x_0, y_0) \\\\
+ &+ \qquad 0 \\\\
+z_1 &= \mathrm{lo}(x_1, y_0) + \mathrm{lo}(x_0, y_1) \\\\
+ &+ \qquad 2( \mathrm{hi}(x_0, y_0) )\\\\
+z_2 &= \mathrm{lo}(x_2, y_0) + \mathrm{lo}(x_1, y_1) + \mathrm{lo}(x_0, y_2) \\\\
+ &+ \qquad 2( \mathrm{hi}(x_1, y_0) + \mathrm{hi}(x_0, y_1) )\\\\
+z_3 &= \mathrm{lo}(x_3, y_0) + \mathrm{lo}(x_2, y_1) + \mathrm{lo}(x_1, y_2) + \mathrm{lo}(x_0, y_3) \\\\
+ &+ \qquad 2( \mathrm{hi}(x_2, y_0) + \mathrm{hi}(x_1, y_1) + \mathrm{hi}(x_0, y_2) )\\\\
+z_4 &= \mathrm{lo}(x_4, y_0) + \mathrm{lo}(x_3, y_1) + \mathrm{lo}(x_2, y_2) + \mathrm{lo}(x_1, y_3) + \mathrm{lo}(x_0, y_4) \\\\
+ &+ \qquad 2( \mathrm{hi}(x_3, y_0) + \mathrm{hi}(x_2, y_1) + \mathrm{hi}(x_1, y_2) + \mathrm{hi}(x_0, y_3) )\\\\
+z_5 &=                         \mathrm{lo}(x_4, y_1) + \mathrm{lo}(x_3, y_2) + \mathrm{lo}(x_2, y_3) + \mathrm{lo}(x_1, y_4) \\\\
+ &+ \qquad 2( \mathrm{hi}(x_4, y_0) + \mathrm{hi}(x_3, y_1) + \mathrm{hi}(x_2, y_2) + \mathrm{hi}(x_1, y_3) + \mathrm{hi}(x_0, y_4) )\\\\
+z_6 &=                                                 \mathrm{lo}(x_4, y_2) + \mathrm{lo}(x_3, y_3) + \mathrm{lo}(x_2, y_4) \\\\
+ &+ \qquad 2(                         \mathrm{hi}(x_4, y_1) + \mathrm{hi}(x_3, y_2) + \mathrm{hi}(x_2, y_3) + \mathrm{hi}(x_1, y_4) )\\\\
+z_7 &=                                                                         \mathrm{lo}(x_4, y_3) + \mathrm{lo}(x_3, y_4) \\\\
+ &+ \qquad 2(                                                 \mathrm{hi}(x_4, y_2) + \mathrm{hi}(x_3, y_3) + \mathrm{hi}(x_2, y_4) )\\\\
+z_8 &=                                                                                                 \mathrm{lo}(x_4, y_4) \\\\
+ &+ \qquad 2(                                                                         \mathrm{hi}(x_4, y_3) + \mathrm{hi}(x_3, y_4) )\\\\
+z_9 &= 0 \\\\
+ &+ \qquad 2(                                                                                                 \mathrm{hi}(x_4, y_4) )\\\\
+\end{aligned}
+\\]
+As noted above, our strategy will be to multiply and accumulate the
+terms with coefficient \\(2\\) separately from those with coefficient
+\\(1\\), before combining them at the end.  This can alternately be
+thought of as accumulating product terms into a *doubly-redundant*
+representation, with two limbs for each digit, before collapsing 
+the doubly-redundant representation by shifts and adds.
+
+This computation requires 25 `vpmadd52luq` and 25 `vpmadd52huq`
+operations.  For 256-bit vectors, IFMA operations execute on an
+i3-8121U with latency 4 cycles, throughput 0.5 cycles, so executing 50
+instructions requires 25 cycles' worth of throughput.  Accumulating
+terms with coefficient \\(1\\) and \\(2\\) separately means that the
+longest dependency chain has length 5, so the critical path has length
+20 cycles and the bottleneck is throughput.
+
+# Reduction modulo \\(p\\)
+
+The next question is how to handle the reduction modulo \\(p\\).
+Because \\(p = 2^{255} - 19\\), \\(2^{255} = 19 \pmod p\\), so we can
+alternately write
+\\[
+\begin{aligned}
+z &= z_0 + z_1 2^{51} + z_2 2^{102} + z_3 2^{153} + z_4 2^{204} \\\\
+  &+ z_5 2^{255} + z_6 2^{306} + z_7 2^{357} + z_8 2^{408} + z_9 2^{459}
+\end{aligned}
+\\]
+as
+\\[
+\begin{aligned}
+z &= (z_0 + 19z_5) + (z_1 + 19z_6) 2^{51} + (z_2 + 19z_7) 2^{102} + (z_3 + 19z_8) 2^{153} + (z_4 + 19z_9) 2^{204}.
+\end{aligned}
+\\]
+When using a \\(64 \times 64 \rightarrow 128\\)-bit multiplier, this
+can be handled (as in [Ed25519][ed25519_paper]) by premultiplying
+source terms by \\(19\\).  Since \\(\lg(19) < 4.25\\), this increases
+their size by less than \\(4.25\\) bits, and the rest of the
+multiplication can be shown to work out.
+
+Here, we have at most \\(1\\) bit of headroom.  In order to allow
+premultiplication, we would need to use radix \\(2^{47}\\), which
+would require six limbs instead of five.  Instead, we compute the high
+terms \\(z_5, \ldots, z_9\\), each using two chains of IFMA
+operations, then multiply by \\(19\\) and combine with the lower terms
+\\(z_0, \ldots, z_4\\).  There are two ways to perform the
+multiplication by \\(19\\): using more IFMA operations, or using the
+`vpmullq` instruction, which computes the low \\(64\\) bits of a \\(64
+\times 64\\)-bit product.  However, `vpmullq` has 15c/1.5c
+latency/throughput, in contrast to the 4c/0.5c latency/throughput of
+IFMA operations, so it seems like a worse choice.
+
+The high terms \\(z_5, \ldots, z_9\\) are sums of \\(52\\)-bit terms,
+so they are larger than \\(52\\) bits.  Write these terms in radix \\(52\\) as
+\\[
+z_{5+i} = z_{5+i}' + z_{5+i}'' 2^{52}, \qquad z_{5+i}' < 2^{52}.
+\\]
+Then the contribution of \\(z_{5+i}\\), taken modulo \\(p\\), is
+\\[
+\begin{aligned}
+z_{5+i} 2^{255} 2^{51 i}
+&= 
+19 (z_{5+i}' + z_{5+i}'' 2^{52}) 2^{51 i} 
+\\\\
+&= 
+19 z_{5+i}' 2^{51 i} + 2 \cdot 19 z_{5+i}'' 2^{51 (i+1)}
+\\\\
+\end{aligned}
+\\]
+The products \\(19 z_{5+i}', 19 z_{5+i}''\\) can be written in terms of IFMA operations as
+\\[
+\begin{aligned}
+19 z_{5+i}' &= \mathrm{lo}(19, z_{5+i}') + 2 \mathrm{hi}(19, z_{5+i}') 2^{51}, \\\\
+19 z_{5+i}'' &= \mathrm{lo}(19, z_{5+i}'') + 2 \mathrm{hi}(19, z_{5+i}'') 2^{51}. \\\\
+\end{aligned}
+\\]
+Because \\(z_{5+i} < 2^{64}\\), \\(z_{5+i}'' < 2^{12} \\), so \\(19
+z_{5+i}'' < 2^{17} < 2^{52} \\) and \\(\mathrm{hi}(19, z_{5+i}'') = 0\\).
+Because IFMA operations ignore the high bits of their source
+operands, we do not need to compute \\(z\_{5+i}'\\) explicitly:
+the high bits will be ignored.
+Combining these observations, we can write
+\\[
+\begin{aligned}
+z_{5+i} 2^{255} 2^{51 i}
+&= 
+19 z_{5+i}' 2^{51 i} + 2 \cdot 19 z_{5+i}'' 2^{51 (i+1)}
+\\\\
+&= 
+\mathrm{lo}(19, z_{5+i}) 2^{51 i}
+\+ 2 \mathrm{hi}(19, z_{5+i}) 2^{51 (i+1)}
+\+ 2 \mathrm{lo}(19, z_{5+i}/2^{52}) 2^{51 (i+1)}.
+\end{aligned}
+\\]
+
+For \\(i = 0,1,2,3\\), this allows reducing \\(z_{5+i}\\) onto
+\\(z_{i}, z_{i+1}\\), and if the low terms are computed using a
+doubly-redundant representation, no additional shifts are needed to
+handle the \\(2\\) coefficients.  For \\(i = 4\\), there's a
+complication: the contribution becomes
+\\[
+\begin{aligned}
+z_{9} 2^{255} 2^{204}
+&= 
+\mathrm{lo}(19, z_{9}) 2^{204}
+\+ 2 \mathrm{hi}(19, z_{9}) 2^{255}
+\+ 2 \mathrm{lo}(19, z_{9}/2^{52}) 2^{255}
+\\\\
+&= 
+\mathrm{lo}(19, z_{9}) 2^{204}
+\+ 2 \mathrm{hi}(19, z_{9}) 19
+\+ 2 \mathrm{lo}(19, z_{9}/2^{52}) 19
+\\\\
+&=
+\mathrm{lo}(19, z_{9}) 2^{204}
+\+ 2 
+\mathrm{lo}(19, \mathrm{hi}(19, z_{9}) + \mathrm{lo}(19, z_{9}/2^{52})).
+\\\\
+\end{aligned}
+\\]
+
+It would be possible to cut the number of multiplications from 3 to 2
+by carrying the high part of each \\(z_i\\) onto \\(z_{i+1}\\). This
+would eliminate 5 multiplications, clearing 2.5 cycles of port
+pressure, at the cost of 5 additions, adding 1.66 cycles of port
+pressure.  But doing this would create a dependency between terms
+(e.g., \\(z_{5}\\) must be computed before the reduction of
+\\(z_{6}\\) can begin), whereas with the approach above, all
+contributions to all terms are computed independently, to maximize ILP
+and flexibility for the processor to schedule instructions.
+
+This strategy performs 16 IFMA operations, adding two IFMA operations
+to each of the \\(2\\)-coefficient terms and one to each of the
+\\(1\\)-coefficient terms.  Considering the multiplication and
+reduction together, we use 66 IFMA operations, requiring 33 cycles'
+throughput, while the longest chain of IFMA operations is in the
+reduction of \\(z_5\\) onto \\(z_1\\), of length 7 (so 28 cycles, plus
+2 cycles to combine the two parts of \\(z_5\\), and the bottleneck is
+again throughput.
+
+Once this is done, we have computed the product terms
+\\[
+z = z_0 + z_1 2^{51} + z_2 2^{102} + z_3 2^{153} + z_4 2^{204},
+\\]
+without reducing the \\(z_i\\) to fit in \\(52\\) bits.  Because the
+overall flow of operations alternates multiplications and additions or
+subtractions, we would have to perform a reduction after an addition
+but before the next multiplication anyways, so there's no benefit to
+fully reducing the limbs at the end of a multiplication.  Instead, we
+leave them unreduced, and track the reduction state using the type
+system to ensure that unreduced limbs are not accidentally used as an
+input to a multiplication.
+
+# Squaring
+
+Squaring operates similarly to multiplication, but with the
+possibility to combine identical terms.
+As before, we write the input as
+\\[
+\begin{aligned}
+x &= x_0 + x_1 2^{51} + x_2 2^{102} + x_3 2^{153} + x_4 2^{204}
+\end{aligned}
+\\]
+with limbs in range \\([0,2^{52})\\).
+Writing the product terms as
+\\[
+\begin{aligned}
+z &= z_0 + z_1 2^{51} + z_2 2^{102} + z_3 2^{153} + z_4 2^{204} \\\\
+  &+ z_5 2^{255} + z_6 2^{306} + z_7 2^{357} + z_8 2^{408} + z_9 2^{459},
+\end{aligned}
+\\]
+a schoolbook squaring in product scanning form takes the form
+\\[
+\begin{aligned}
+z_0 &=   x_0 x_0 \\\\
+z_1 &= 2 x_1 x_0 \\\\
+z_2 &= 2 x_2 x_0 +   x_1 x_1 \\\\
+z_3 &= 2 x_3 x_0 + 2 x_2 x_1 \\\\
+z_4 &= 2 x_4 x_0 + 2 x_3 x_1 + x_2 x_2 \\\\
+z_5 &= 2 x_4 x_1 + 2 x_3 x_2 \\\\
+z_6 &= 2 x_4 x_2 +   x_3 x_3 \\\\
+z_7 &= 2 x_4 x_3 \\\\
+z_8 &=   x_4 x_4 \\\\
+z_9 &= 0 \\\\
+\end{aligned}
+\\]
+As before, we write \\(x_i x_j\\) as
+\\[
+x_i x_j = \mathrm{lo}(x_i,x_j) + 2\mathrm{hi}(x_i,x_j)2^{51},
+\\]
+and substitute to obtain
+\\[
+\begin{aligned}
+z_0 &=   \mathrm{lo}(x_0, x_0) + 0 \\\\
+z_1 &= 2 \mathrm{lo}(x_1, x_0) + 2 \mathrm{hi}(x_0, x_0) \\\\
+z_2 &= 2 \mathrm{lo}(x_2, x_0) +   \mathrm{lo}(x_1, x_1) + 4 \mathrm{hi}(x_1, x_0) \\\\
+z_3 &= 2 \mathrm{lo}(x_3, x_0) + 2 \mathrm{lo}(x_2, x_1) + 4 \mathrm{hi}(x_2, x_0) + 2 \mathrm{hi}(x_1, x_1) \\\\
+z_4 &= 2 \mathrm{lo}(x_4, x_0) + 2 \mathrm{lo}(x_3, x_1) +   \mathrm{lo}(x_2, x_2) + 4 \mathrm{hi}(x_3, x_0) + 4 \mathrm{hi}(x_2, x_1) \\\\
+z_5 &= 2 \mathrm{lo}(x_4, x_1) + 2 \mathrm{lo}(x_3, x_2) + 4 \mathrm{hi}(x_4, x_0) + 4 \mathrm{hi}(x_3, x_1) + 2 \mathrm{hi}(x_2, x_2) \\\\
+z_6 &= 2 \mathrm{lo}(x_4, x_2) +   \mathrm{lo}(x_3, x_3) + 4 \mathrm{hi}(x_4, x_1) + 4 \mathrm{hi}(x_3, x_2) \\\\
+z_7 &= 2 \mathrm{lo}(x_4, x_3) + 4 \mathrm{hi}(x_4, x_2) + 2 \mathrm{hi}(x_3, x_3) \\\\
+z_8 &=   \mathrm{lo}(x_4, x_4) + 4 \mathrm{hi}(x_4, x_3) \\\\
+z_9 &= 0 + 2 \mathrm{hi}(x_4, x_4) \\\\
+\end{aligned}
+\\]
+To implement these, we group terms by their coefficient, computing
+those with coefficient \\(2\\) on set of IFMA chains, and on another
+set of chains, we begin with coefficient-\\(4\\) terms, then shift
+left before continuing with the coefficient-\\(1\\) terms.
+The reduction strategy is the same as for multiplication.
+
+# Future improvements
+
+LLVM won't use blend operations on [256-bit vectors yet][llvm_blend],
+so there's a bunch of blend instructions that could be omitted.
+
+Although the multiplications and squarings are much faster, there's no
+speedup to the additions and subtractions, so there are diminishing
+returns.  In fact, the complications in the doubling formulas mean
+that doubling is actually slower than readdition.  This also suggests
+that moving to 512-bit vectors won't be much help for a strategy aimed
+at parallelism within a group operation, so to extract performance
+gains from 512-bit vectors it will probably be necessary to create a
+parallel-friendly multiscalar multiplication algorithm.  This could
+also help with reducing shuffle pressure.
+
+The squaring implementation could probably be optimized, but without
+`perf` support on Cannonlake it's difficult to make actual
+measurements.
+
+Another improvement would be to implement vectorized square root
+computations, which would allow creating an iterator adaptor for point
+decompression that bunched decompression operations and executed them
+in parallel.  This would accelerate batch verification.
+
+[2016_gueron_krasnov]: https://ieeexplore.ieee.org/document/7563269
+[2018_drucker_gueron]: https://eprint.iacr.org/2018/335
+[1999_walter]: https://pdfs.semanticscholar.org/0e6a/3e8f30b63b556679f5dff2cbfdfe9523f4fa.pdf
+[ed25519_paper]: https://ed25519.cr.yp.to/ed25519-20110926.pdf
+[llvm_blend]: https://bugs.llvm.org/show_bug.cgi?id=38343
--- a/vendor/curve25519-dalek/docs/parallel-formulas.md
+++ b/vendor/curve25519-dalek/docs/parallel-formulas.md
@@ -0,0 +1,333 @@
+Vectorized implementations of field and point operations, using a
+modification of the 4-way parallel formulas of Hisil, Wong, Carter,
+and Dawson.
+
+These notes explain the parallel formulas and our strategy for using
+them with SIMD operations.  There are two backend implementations: one
+using AVX2, and the other using AVX512-IFMA.
+
+# Overview
+
+The 2008 paper [_Twisted Edwards Curves Revisited_][hwcd08] by Hisil,
+Wong, Carter, and Dawson (HWCD) introduced the “extended coordinates”
+and mixed-model representations which are used by most Edwards curve
+implementations.
+
+However, they also describe 4-way parallel formulas for point addition
+and doubling: a unified addition algorithm taking an effective
+\\(2\mathbf M + 1\mathbf D\\), a doubling algorithm taking an
+effective \\(1\mathbf M + 1\mathbf S\\), and a dedicated (i.e., for
+distinct points) addition algorithm taking an effective \\(2 \mathbf M
+\\).  They compare these formulas with a 2-way parallel variant of the
+Montgomery ladder.
+
+Unlike their serial formulas, which are used widely, their parallel
+formulas do not seem to have been implemented in software before.  The
+2-way parallel Montgomery ladder was used in 2015 by Tung Chou's
+`sandy2x` implementation.  Curiously, however, although the [`sandy2x`
+paper][sandy2x] also implements Edwards arithmetic, and cites HWCD08,
+it doesn't mention their parallel Edwards formulas.
+A 2015 paper by Hernández and López describes an AVX2 implementation
+of X25519. Neither the paper nor the code are publicly available, but
+it apparently gives only a [slight speedup][avx2trac], suggesting that
+it uses a 4-way parallel Montgomery ladder rather than parallel
+Edwards formulas.
+
+The reason may be that HWCD08 describe their formulas as operating on
+four independent processors, which would make a software
+implementation impractical: all of the operations are too low-latency
+to effectively synchronize.  But a closer inspection reveals that the
+(more expensive) multiplication and squaring steps are uniform, while
+the instruction divergence occurs in the (much cheaper) addition and
+subtraction steps.  This means that a SIMD implementation can perform
+the expensive steps uniformly, and handle divergence in the
+inexpensive steps using masking.
+
+These notes describe modifications to the original parallel formulas
+to allow a SIMD implementation, and this module contains
+implementations of the modified formulas targeting either AVX2 or
+AVX512-IFMA.
+
+# Parallel formulas in HWCD'08
+
+The doubling formula is presented in the HWCD paper as follows:
+
+| Cost             | Processor 1                    | Processor 2                    | Processor 3                    | Processor 4                    |
+|------------------|--------------------------------|--------------------------------|--------------------------------|--------------------------------|
+|                  | idle                           | idle                           | idle                           | \\( R\_1 \gets X\_1 + Y\_1 \\) |
+| \\(1\mathbf S\\) | \\( R\_2 \gets X\_1\^2 \\)     | \\( R\_3 \gets Y\_1\^2   \\)   | \\( R\_4 \gets Z\_1\^2   \\)   | \\( R\_5 \gets R\_1\^2   \\)   |
+|                  | \\( R\_6 \gets R\_2 + R\_3 \\) | \\( R\_7 \gets R\_2 - R\_3 \\) | \\( R\_4 \gets 2 R\_4 \\)      | idle                           |
+|                  | idle                           | \\( R\_1 \gets R\_4 + R\_7 \\) | idle                           | \\( R\_2 \gets R\_6 - R\_5 \\) |
+| \\(1\mathbf M\\) | \\( X\_3 \gets R\_1 R\_2 \\)   | \\( Y\_3 \gets R\_6 R\_7 \\)   | \\( T\_3 \gets R\_2 R\_6 \\)   | \\( Z\_3 \gets R\_1 R\_7 \\)   |
+
+and the unified addition algorithm is presented as follows:
+
+| Cost             | Processor 1                    | Processor 2                    | Processor 3                    | Processor 4                    |
+|------------------|--------------------------------|--------------------------------|--------------------------------|--------------------------------|
+|                  | \\( R\_1 \gets Y\_1 - X\_1 \\) | \\( R\_2 \gets Y\_2 - X\_2 \\) | \\( R\_3 \gets Y\_1 + X\_1 \\) | \\( R\_4 \gets Y\_2 + X\_2 \\) |
+| \\(1\mathbf M\\) | \\( R\_5 \gets R\_1 R\_2 \\)   | \\( R\_6 \gets R\_3 R\_4 \\)   | \\( R\_7 \gets T\_1 T\_2 \\)   | \\( R\_8 \gets Z\_1 Z\_2 \\)   |
+| \\(1\mathbf D\\) | idle                           | idle                           | \\( R\_7 \gets k R\_7 \\)      | \\( R\_8 \gets 2 R\_8 \\)      |
+|                  | \\( R\_1 \gets R\_6 - R\_5 \\) | \\( R\_2 \gets R\_8 - R\_7 \\) | \\( R\_3 \gets R\_8 + R\_7 \\) | \\( R\_4 \gets R\_6 + R\_5 \\) |
+| \\(1\mathbf M\\) | \\( X\_3 \gets R\_1 R\_2 \\)   | \\( Y\_3 \gets R\_3 R\_4 \\)   | \\( T\_3 \gets R\_1 R\_4 \\)   | \\( Z\_3 \gets R\_2 R\_3 \\)   |
+
+Here \\(\mathbf M\\) and \\(\mathbf S\\) represent the cost of
+multiplication and squaring of generic field elements, \\(\mathbf D\\)
+represents the cost of multiplication by a curve constant (in this
+case \\( k = 2d \\)).
+
+Notice that the \\(1\mathbf M\\) and \\(1\mathbf S\\) steps are
+uniform.  The non-uniform steps are all inexpensive additions or
+subtractions, with the exception of the multiplication by the curve
+constant \\(k = 2d\\):
+$$
+R\_7 \gets 2 d R\_7.
+$$
+
+HWCD suggest parallelising this step by breaking \\(k = 2d\\) into four
+parts as \\(k = k_0 + 2\^n k_1 + 2\^{2n} k_2 + 2\^{3n} k_3 \\) and
+computing \\(k_i R_7 \\) in parallel.  This is quite awkward, but if
+the curve constant is a ratio \\( d = d\_1/d\_2 \\), then projective
+coordinates allow us to instead compute 
+$$
+(R\_5, R\_6, R\_7, R\_8) \gets (d\_2 R\_5, d\_2 R\_6, 2d\_1 R\_7, d\_2 R\_8).
+$$
+This can be performed as a uniform multiplication by a vector of
+constants, and if \\(d\_1, d\_2\\) are small, it is relatively
+inexpensive.  (This trick was suggested by Mike Hamburg).
+In the Curve25519 case, we have 
+$$
+d = \frac{d\_1}{d\_2} = \frac{-121665}{121666};
+$$
+Since \\(2 \cdot 121666 < 2\^{18}\\), all the constants above fit (up
+to sign) in 32 bits, so this can be done in parallel as four
+multiplications by small constants \\( (121666, 121666, 2\cdot 121665,
+2\cdot 121666) \\), followed by a negation to compute \\( - 2\cdot 121665\\).
+
+# Modified parallel formulas
+
+Using the modifications sketched above, we can write SIMD-friendly
+versions of the parallel formulas as follows.  To avoid confusion with
+the original formulas, temporary variables are named \\(S\\) instead
+of \\(R\\) and are in static single-assignment form.
+
+## Addition
+
+To add points
+\\(P_1 = (X_1 : Y_1 : Z_1 : T_1) \\)
+and
+\\(P_2 = (X_2 : Y_2 : Z_2 : T_2 ) \\),
+we compute
+$$
+\begin{aligned}
+(S\_0 &&,&& S\_1 &&,&& S\_2 &&,&& S\_3 )
+&\gets
+(Y\_1 - X\_1&&,&& Y\_1 + X\_1&&,&& Y\_2 - X\_2&&,&& Y\_2 + X\_2)
+\\\\
+(S\_4 &&,&& S\_5 &&,&& S\_6 &&,&& S\_7 )
+&\gets
+(S\_0 \cdot S\_2&&,&& S\_1 \cdot S\_3&&,&& Z\_1 \cdot Z\_2&&,&& T\_1 \cdot T\_2)
+\\\\
+(S\_8    &&,&& S\_9    &&,&& S\_{10} &&,&& S\_{11} )
+&\gets
+(d\_2 \cdot S\_4 &&,&& d\_2 \cdot S\_5 &&,&& 2 d\_2 \cdot S\_6 &&,&& 2 d\_1 \cdot S\_7 )
+\\\\
+(S\_{12} &&,&& S\_{13} &&,&& S\_{14} &&,&& S\_{15})
+&\gets
+(S\_9 - S\_8&&,&& S\_9 + S\_8&&,&& S\_{10} - S\_{11}&&,&& S\_{10} + S\_{11})
+\\\\
+(X\_3&&,&& Y\_3&&,&& Z\_3&&,&& T\_3)
+&\gets
+(S\_{12} \cdot S\_{14}&&,&& S\_{15} \cdot S\_{13}&&,&& S\_{15} \cdot S\_{14}&&,&& S\_{12} \cdot S\_{13})
+\end{aligned}
+$$
+to obtain \\( P\_3 = (X\_3 : Y\_3 : Z\_3 : T\_3) = P\_1 + P\_2 \\).
+This costs \\( 2\mathbf M + 1 \mathbf D\\).
+
+## Readdition
+
+If the point \\( P\_2 = (X\_2 : Y\_2 : Z\_2 : T\_2) \\) is fixed, we
+can cache the multiplication of the curve constants by computing
+$$
+\begin{aligned}
+(S\_2\' &&,&& S\_3\' &&,&& Z\_2\' &&,&& T\_2\' )
+&\gets
+(d\_2 \cdot (Y\_2 - X\_2)&&,&& d\_2 \cdot (Y\_1 + X\_1)&&,&& 2d\_2 \cdot Z\_2 &&,&& 2d\_1 \cdot T\_2).
+\end{aligned}
+$$
+This costs \\( 1\mathbf D\\); with \\( (S\_2\', S\_3\', Z\_2\', T\_2\')\\)
+in hand, the addition formulas above become
+$$
+\begin{aligned}
+(S\_0 &&,&& S\_1 &&,&& Z\_1 &&,&& T\_1 )
+&\gets
+(Y\_1 - X\_1&&,&& Y\_1 + X\_1&&,&& Z\_1 &&,&& T\_1)
+\\\\
+(S\_8    &&,&& S\_9    &&,&& S\_{10} &&,&& S\_{11} )
+&\gets
+(S\_0 \cdot S\_2\' &&,&& S\_1 \cdot S\_3\'&&,&& Z\_1 \cdot Z\_2\' &&,&& T\_1 \cdot T\_2\')
+\\\\
+(S\_{12} &&,&& S\_{13} &&,&& S\_{14} &&,&& S\_{15})
+&\gets
+(S\_9 - S\_8&&,&& S\_9 + S\_8&&,&& S\_{10} - S\_{11}&&,&& S\_{10} + S\_{11})
+\\\\
+(X\_3&&,&& Y\_3&&,&& Z\_3&&,&& T\_3)
+&\gets
+(S\_{12} \cdot S\_{14}&&,&& S\_{15} \cdot S\_{13}&&,&& S\_{15} \cdot S\_{14}&&,&& S\_{12} \cdot S\_{13})
+\end{aligned}
+$$
+which costs only \\( 2\mathbf M \\).  This precomputation is
+essentially similar to the precomputation that HWCD suggest for their
+serial formulas.  Because the cost of precomputation and then
+readdition is the same as addition, it's sufficient to only
+implement caching and readdition.
+
+## Doubling
+
+The non-uniform portions of the (re)addition formulas have a fairly
+regular structure.  Unfortunately, this is not the case for the
+doubling formulas, which are much less nice.
+
+To double a point \\( P = (X\_1 : Y\_1 : Z\_1 : T\_1) \\), we compute
+$$
+\begin{aligned}
+(X\_1 &&,&& Y\_1 &&,&& Z\_1 &&,&& S\_0)
+&\gets
+(X\_1 &&,&& Y\_1 &&,&& Z\_1 &&,&& X\_1 + Y\_1)
+\\\\
+(S\_1 &&,&& S\_2 &&,&& S\_3 &&,&& S\_4 )
+&\gets
+(X\_1\^2 &&,&& Y\_1\^2&&,&& Z\_1\^2 &&,&& S\_0\^2)
+\\\\
+(S\_5 &&,&& S\_6 &&,&& S\_8 &&,&& S\_9 )
+&\gets
+(S\_1 + S\_2 &&,&& S\_1 - S\_2 &&,&& S\_1 + 2S\_3 - S\_2 &&,&& S\_1 + S\_2 - S\_4)
+\\\\
+(X\_3 &&,&& Y\_3 &&,&& Z\_3 &&,&& T\_3 )
+&\gets
+(S\_8 \cdot S\_9 &&,&& S\_5 \cdot S\_6 &&,&& S\_8 \cdot S\_6 &&,&& S\_5 \cdot S\_9)
+\end{aligned}
+$$
+to obtain \\( P\_3 = (X\_3 : Y\_3 : Z\_3 : T\_3) = \[2\]P\_1 \\).
+
+The intermediate step between the squaring and multiplication requires
+a long chain of additions.  For the IFMA-based implementation, this is not a problem; for the AVX2-based implementation, it is, but with some care and finesse, it's possible to arrange the computation without requiring an intermediate reduction.
+
+# Implementation
+
+These formulas aren't specific to a particular representation of field
+element vectors, whose optimum choice is determined by the details of
+the instruction set.  However, it's not possible to perfectly separate
+the implementation of the field element vectors from the
+implementation of the point operations.  Instead, the [`avx2`] and
+[`ifma`] backends provide `ExtendedPoint` and `CachedPoint` types, and
+the [`scalar_mul`] code uses one of the backend types by a type alias.
+
+# Comparison to non-vectorized formulas
+
+In theory, the parallel Edwards formulas seem to allow a \\(4\\)-way
+speedup from parallelism.  However, an actual vectorized
+implementation has several slowdowns that cut into this speedup.
+
+First, the parallel formulas can only use the available vector
+multiplier.  For AVX2, this is a \\( 32 \times 32 \rightarrow 64
+\\)-bit integer multiplier, so the speedup from vectorization must
+overcome the disadvantage of losing the \\( 64 \times 64 \rightarrow
+128\\)-bit (serial) integer multiplier.  The effect of this slowdown
+is microarchitecture-dependent, since it requires accounting for the
+total number of multiplications and additions and their relative
+costs.  IFMA allows using a \\( 52 \times 52 \rightarrow 104 \\)-bit
+multiplier, but the high and low halves need to be computed
+separately, and the reduction requires extra work because it's not
+possible to pre-multiply by \\(19\\).
+
+Second, the parallel doubling formulas incur both a theoretical and
+practical slowdown.  The parallel formulas described above work on the
+\\( \mathbb P\^3 \\) “extended” coordinates.  The \\( \mathbb P\^2 \\)
+model introduced earlier by [Bernstein, Birkner, Joye, Lange, and
+Peters][bbjlp08] allows slightly faster doublings, so HWCD suggest
+mixing coordinate systems while performing scalar multiplication
+(attributing the idea to [a 1998 paper][cmo98] by Cohen, Miyagi, and
+Ono).  The \\( T \\) coordinate is not required for doublings, so when
+doublings are followed by doublings, its computation can be skipped.
+More details on this approach and the different coordinate systems can
+be found in the [`curve_models` module documentation][curve_models].
+
+Unfortunately, this optimization is not compatible with the parallel
+formulas, which cannot save time by skipping a single variable, so the
+parallel doubling formulas do slightly more work when counting the
+total number of field multiplications and squarings.
+
+In addition, the parallel doubling formulas have a less regular
+pattern of additions and subtractions than the parallel addition
+formulas, so the vectorization overhead is proportionately greater.
+Both the parallel addition and parallel doubling formulas also require
+some shuffling to rearrange data within the vectors, which places more
+pressure on the shuffle unit than is desirable.
+
+This means that the speedup from using a vectorized implementation of
+parallel Edwards formulas is likely to be greatest in applications
+that do fewer doublings and more additions (like a large multiscalar
+multiplication) rather than applications that do fewer additions and
+more doublings (like a double-base scalar multiplication).
+
+Third, Amdahl's law says that the speedup is limited to the portion
+which can be parallelized.  Normally, the field multiplications
+dominate the cost of point operations, but with the IFMA backend, the
+multiplications are so fast that the non-parallel additions end up as
+a significant portion of the total time.
+
+Fourth, current Intel CPUs perform thermal throttling when using wide
+vector instructions.  A detailed description can be found in §15.26 of
+[the Intel Optimization Manual][intel], but using wide vector
+instructions prevents the core from operating at higher frequencies.
+The core can return to the higher-frequency state after 2
+milliseconds, but this timer is reset every time high-power
+instructions are used.
+
+Any speedup from vectorization therefore has to be weighed against a
+slowdown for the next few million instructions.  For a mixed workload,
+where point operations are interspersed with other tasks, this can
+reduce overall performance.  This implementation is therefore probably
+not suitable for basic applications, like signatures, but is
+worthwhile for complex applications, like zero-knowledge proofs, which
+do sustained work.
+
+# Future work
+
+There are several directions for future improvement:
+
+* Using the vectorized field arithmetic code to parallelize across
+  point operations rather than within a single point operation.  This
+  is less flexible, but would give a speedup both from allowing use of
+  the faster mixed-model arithmetic and from reducing shuffle
+  pressure.  One approach in this direction would be to implement
+  batched scalar-point operations using vectors of points (AoSoA
+  layout).  This less generally useful but would give a speedup for
+  Bulletproofs.
+
+* Extending the IFMA implementation to use the full width of AVX512,
+  either handling the extra parallelism internally to a single point
+  operation (by using a 2-way parallel implementation of field
+  arithmetic instead of a wordsliced one), or externally,
+  parallelizing across point operations.  Internal parallelism would
+  be preferable but might require too much shuffle pressure.  For now,
+  the only available CPU which runs IFMA operations executes them at
+  256-bits wide anyways, so this isn't yet important.
+
+* Generalizing the implementation to NEON instructions.  The current
+  point arithmetic code is written in terms of field element vectors,
+  which are in turn implemented using platform SIMD vectors.  It
+  should be possible to write an alternate implementation of the
+  `FieldElement2625x4` using NEON without changing the point
+  arithmetic.  NEON has 128-bit vectors rather than 256-bit vectors,
+  but this may still be worthwhile compared to a serial
+  implementation.
+
+
+[sandy2x]: https://eprint.iacr.org/2015/943.pdf
+[avx2trac]: https://trac.torproject.org/projects/tor/ticket/8897#comment:28
+[hwcd08]: https://www.iacr.org/archive/asiacrypt2008/53500329/53500329.pdf
+[curve_models]: https://docs.rs/curve25519-dalek/latest/curve25519-dalek/backend/serial/curve_models/index.html
+[bbjlp08]: https://eprint.iacr.org/2008/013
+[cmo98]: https://link.springer.com/content/pdf/10.1007%2F3-540-49649-1_6.pdf
+[intel]: https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf