secp256k1

2025-08-16 10:48:58 -04:00
parent 2036d0165b
commit 77186c88dd
175 changed files with 79997 additions and 1 deletions
--- a/secp256k1/doc/ellswift.md
+++ b/secp256k1/doc/ellswift.md
@@ -0,0 +1,483 @@
+# ElligatorSwift for secp256k1 explained
+
+In this document we explain how the `ellswift` module implementation is related to the
+construction in the
+["SwiftEC: Shallue–van de Woestijne Indifferentiable Function To Elliptic Curves"](https://eprint.iacr.org/2022/759)
+paper by Jorge Chávez-Saab, Francisco Rodríguez-Henríquez, and Mehdi Tibouchi.
+
+* [1. Introduction](#1-introduction)
+* [2. The decoding function](#2-the-decoding-function)
+  + [2.1 Decoding for `secp256k1`](#21-decoding-for-secp256k1)
+* [3. The encoding function](#3-the-encoding-function)
+  + [3.1 Switching to *v, w* coordinates](#31-switching-to-v-w-coordinates)
+  + [3.2 Avoiding computing all inverses](#32-avoiding-computing-all-inverses)
+  + [3.3 Finding the inverse](#33-finding-the-inverse)
+  + [3.4 Dealing with special cases](#34-dealing-with-special-cases)
+  + [3.5 Encoding for `secp256k1`](#35-encoding-for-secp256k1)
+* [4. Encoding and decoding full *(x, y)* coordinates](#4-encoding-and-decoding-full-x-y-coordinates)
+  + [4.1 Full *(x, y)* coordinates for `secp256k1`](#41-full-x-y-coordinates-for-secp256k1)
+
+## 1. Introduction
+
+The `ellswift` module effectively introduces a new 64-byte public key format, with the property
+that (uniformly random) public keys can be encoded as 64-byte arrays which are computationally
+indistinguishable from uniform byte arrays. The module provides functions to convert public keys
+from and to this format, as well as convenience functions for key generation and ECDH that operate
+directly on ellswift-encoded keys.
+
+The encoding consists of the concatenation of two (32-byte big endian) encoded field elements $u$
+and $t.$ Together they encode an x-coordinate on the curve $x$, or (see further) a full point $(x, y)$ on
+the curve.
+
+**Decoding** consists of decoding the field elements $u$ and $t$ (values above the field size $p$
+are taken modulo $p$), and then evaluating $F_u(t)$, which for every $u$ and $t$ results in a valid
+x-coordinate on the curve. The functions $F_u$ will be defined in [Section 2](#2-the-decoding-function).
+
+**Encoding** a given $x$ coordinate is conceptually done as follows:
+* Loop:
+  * Pick a uniformly random field element $u.$
+  * Compute the set $L = F_u^{-1}(x)$ of $t$ values for which $F_u(t) = x$, which may have up to *8* elements.
+  * With probability $1 - \dfrac{\\#L}{8}$, restart the loop.
+  * Select a uniformly random $t \in L$ and return $(u, t).$
+
+This is the *ElligatorSwift* algorithm, here given for just x-coordinates. An extension to full
+$(x, y)$ points will be given in [Section 4](#4-encoding-and-decoding-full-x-y-coordinates).
+The algorithm finds a uniformly random $(u, t)$ among (almost all) those
+for which $F_u(t) = x.$ Section 3.2 in the paper proves that the number of such encodings for
+almost all x-coordinates on the curve (all but at most 39) is close to two times the field size
+(specifically, it lies in the range $2q \pm (22\sqrt{q} + O(1))$, where $q$ is the size of the field).
+
+## 2. The decoding function
+
+First some definitions:
+* $\mathbb{F}$ is the finite field of size $q$, of characteristic 5 or more, and $q \equiv 1 \mod 3.$
+  * For `secp256k1`, $q = 2^{256} - 2^{32} - 977$, which satisfies that requirement.
+* Let $E$ be the elliptic curve of points $(x, y) \in \mathbb{F}^2$ for which $y^2 = x^3 + ax + b$, with $a$ and $b$
+  public constants, for which $\Delta_E = -16(4a^3 + 27b^2)$ is a square, and at least one of $(-b \pm \sqrt{-3 \Delta_E} / 36)/2$ is a square.
+  This implies that the order of $E$ is either odd, or a multiple of *4*.
+  If $a=0$, this condition is always fulfilled.
+  * For `secp256k1`, $a=0$ and $b=7.$
+* Let the function $g(x) = x^3 + ax + b$, so the $E$ curve equation is also $y^2 = g(x).$
+* Let the function $h(x) = 3x^3 + 4a.$
+* Define $V$ as the set of solutions $(x_1, x_2, x_3, z)$ to $z^2 = g(x_1)g(x_2)g(x_3).$
+* Define $S_u$ as the set of solutions $(X, Y)$ to $X^2 + h(u)Y^2 = -g(u)$ and $Y \neq 0.$
+* $P_u$ is a function from $\mathbb{F}$ to $S_u$ that will be defined below.
+* $\psi_u$ is a function from $S_u$ to $V$ that will be defined below.
+
+**Note**: In the paper:
+* $F_u$ corresponds to $F_{0,u}$ there.
+* $P_u(t)$ is called $P$ there.
+* All $S_u$ sets together correspond to $S$ there.
+* All $\psi_u$ functions together (operating on elements of $S$) correspond to $\psi$ there.
+
+Note that for $V$, the left hand side of the equation $z^2$ is square, and thus the right
+hand must also be square. As multiplying non-squares results in a square in $\mathbb{F}$,
+out of the three right-hand side factors an even number must be non-squares.
+This implies that exactly *1* or exactly *3* out of
+$\\{g(x_1), g(x_2), g(x_3)\\}$ must be square, and thus that for any $(x_1,x_2,x_3,z) \in V$,
+at least one of $\\{x_1, x_2, x_3\\}$ must be a valid x-coordinate on $E.$ There is one exception
+to this, namely when $z=0$, but even then one of the three values is a valid x-coordinate.
+
+**Define** the decoding function $F_u(t)$ as:
+* Let $(x_1, x_2, x_3, z) = \psi_u(P_u(t)).$
+* Return the first element $x$ of $(x_3, x_2, x_1)$ which is a valid x-coordinate on $E$ (i.e., $g(x)$ is square).
+
+$P_u(t) = (X(u, t), Y(u, t))$, where:
+
+$$
+\begin{array}{lcl}
+X(u, t) & = & \left\\{\begin{array}{ll}
+  \dfrac{g(u) - t^2}{2t} & a = 0 \\
+  \dfrac{g(u) + h(u)(Y_0(u) - X_0(u)t)^2}{X_0(u)(1 + h(u)t^2)} & a \neq 0
+\end{array}\right. \\
+Y(u, t) & = & \left\\{\begin{array}{ll}
+  \dfrac{X(u, t) + t}{u \sqrt{-3}} = \dfrac{g(u) + t^2}{2tu\sqrt{-3}} & a = 0 \\
+  Y_0(u) + t(X(u, t) - X_0(u)) & a \neq 0
+\end{array}\right.
+\end{array}
+$$
+
+$P_u(t)$ is defined:
+* For $a=0$, unless:
+  * $u = 0$ or $t = 0$ (division by zero)
+  * $g(u) = -t^2$ (would give $Y=0$).
+* For $a \neq 0$, unless:
+  * $X_0(u) = 0$ or $h(u)t^2 = -1$ (division by zero)
+  * $Y_0(u) (1 - h(u)t^2) = 2X_0(u)t$ (would give $Y=0$).
+
+The functions $X_0(u)$ and $Y_0(u)$ are defined in Appendix A of the paper, and depend on various properties of $E.$
+
+The function $\psi_u$ is the same for all curves: $\psi_u(X, Y) = (x_1, x_2, x_3, z)$, where:
+
+$$
+\begin{array}{lcl}
+  x_1 & = & \dfrac{X}{2Y} - \dfrac{u}{2} && \\
+  x_2 & = & -\dfrac{X}{2Y} - \dfrac{u}{2} && \\
+  x_3 & = & u + 4Y^2 && \\
+  z   & = & \dfrac{g(x_3)}{2Y}(u^2 + ux_1 + x_1^2 + a) = \dfrac{-g(u)g(x_3)}{8Y^3}
+\end{array}
+$$
+
+### 2.1 Decoding for `secp256k1`
+
+Put together and specialized for $a=0$ curves, decoding $(u, t)$ to an x-coordinate is:
+
+**Define** $F_u(t)$ as:
+* Let $X = \dfrac{u^3 + b - t^2}{2t}.$
+* Let $Y = \dfrac{X + t}{u\sqrt{-3}}.$
+* Return the first $x$ in $(u + 4Y^2, \dfrac{-X}{2Y} - \dfrac{u}{2}, \dfrac{X}{2Y} - \dfrac{u}{2})$ for which $g(x)$ is square.
+
+To make sure that every input decodes to a valid x-coordinate, we remap the inputs in case
+$P_u$ is not defined (when $u=0$, $t=0$, or $g(u) = -t^2$):
+
+**Define** $F_u(t)$ as:
+* Let $u'=u$ if $u \neq 0$; $1$ otherwise (guaranteeing $u' \neq 0$).
+* Let $t'=t$ if $t \neq 0$; $1$ otherwise (guaranteeing $t' \neq 0$).
+* Let $t''=t'$ if $g(u') \neq -t'^2$; $2t'$ otherwise (guaranteeing $t'' \neq 0$ and $g(u') \neq -t''^2$).
+* Let $X = \dfrac{u'^3 + b - t''^2}{2t''}.$
+* Let $Y = \dfrac{X + t''}{u'\sqrt{-3}}.$
+* Return the first $x$ in $(u' + 4Y^2, \dfrac{-X}{2Y} - \dfrac{u'}{2}, \dfrac{X}{2Y} - \dfrac{u'}{2})$ for which $x^3 + b$ is square.
+
+The choices here are not strictly necessary. Just returning a fixed constant in any of the undefined cases would suffice,
+but the approach here is simple enough and gives fairly uniform output even in these cases.
+
+**Note**: in the paper these conditions result in $\infty$ as output, due to the use of projective coordinates there.
+We wish to avoid the need for callers to deal with this special case.
+
+This is implemented in `secp256k1_ellswift_xswiftec_frac_var` (which decodes to an x-coordinate represented as a fraction), and
+in `secp256k1_ellswift_xswiftec_var` (which outputs the actual x-coordinate).
+
+## 3. The encoding function
+
+To implement $F_u^{-1}(x)$, the function to find the set of inverses $t$ for which $F_u(t) = x$, we have to reverse the process:
+* Find all the $(X, Y) \in S_u$ that could have given rise to $x$, through the $x_1$, $x_2$, or $x_3$ formulas in $\psi_u.$
+* Map those $(X, Y)$ solutions to $t$ values using $P_u^{-1}(X, Y).$
+* For each of the found $t$ values, verify that $F_u(t) = x.$
+* Return the remaining $t$ values.
+
+The function $P_u^{-1}$, which finds $t$ given $(X, Y) \in S_u$, is significantly simpler than $P_u:$
+
+$$
+P_u^{-1}(X, Y) = \left\\{\begin{array}{ll}
+Yu\sqrt{-3} - X & a = 0 \\
+\dfrac{Y-Y_0(u)}{X-X_0(u)} & a \neq 0 \land X \neq X_0(u) \\
+\dfrac{-X_0(u)}{h(u)Y_0(u)} & a \neq 0 \land X = X_0(u) \land Y = Y_0(u)
+\end{array}\right.
+$$
+
+The third step above, verifying that $F_u(t) = x$, is necessary because for the $(X, Y)$ values found through the $x_1$ and $x_2$ expressions,
+it is possible that decoding through $\psi_u(X, Y)$ yields a valid $x_3$ on the curve, which would take precedence over the
+$x_1$ or $x_2$ decoding. These $(X, Y)$ solutions must be rejected.
+
+Since we know that exactly one or exactly three out of $\\{x_1, x_2, x_3\\}$ are valid x-coordinates for any $t$,
+the case where either $x_1$ or $x_2$ is valid and in addition also $x_3$ is valid must mean that all three are valid.
+This means that instead of checking whether $x_3$ is on the curve, it is also possible to check whether the other one out of
+$x_1$ and $x_2$ is on the curve. This is significantly simpler, as it turns out.
+
+Observe that $\psi_u$ guarantees that $x_1 + x_2 = -u.$ So given either $x = x_1$ or $x = x_2$, the other one of the two can be computed as
+$-u - x.$ Thus, when encoding $x$ through the $x_1$ or $x_2$ expressions, one can simply check whether $g(-u-x)$ is a square,
+and if so, not include the corresponding $t$ values in the returned set. As this does not need $X$, $Y$, or $t$, this condition can be determined
+before those values are computed.
+
+It is not possible that an encoding found through the $x_1$ expression decodes to a different valid x-coordinate using $x_2$ (which would
+take precedence), for the same reason: if both $x_1$ and $x_2$ decodings were valid, $x_3$ would be valid as well, and thus take
+precedence over both. Because of this, the $g(-u-x)$ being square test for $x_1$ and $x_2$ is the only test necessary to guarantee the found $t$
+values round-trip back to the input $x$ correctly. This is the reason for choosing the $(x_3, x_2, x_1)$ precedence order in the decoder;
+any order which does not place $x_3$ first requires more complicated round-trip checks in the encoder.
+
+### 3.1 Switching to *v, w* coordinates
+
+Before working out the formulas for all this, we switch to different variables for $S_u.$ Let $v = (X/Y - u)/2$, and
+$w = 2Y.$ Or in the other direction, $X = w(u/2 + v)$ and $Y = w/2:$
+* $S_u'$ becomes the set of $(v, w)$ for which $w^2 (u^2 + uv + v^2 + a) = -g(u)$ and $w \neq 0.$
+* For $a=0$ curves, $P_u^{-1}$ can be stated for $(v,w)$ as $P_u^{'-1}(v, w) = w\left(\frac{\sqrt{-3}-1}{2}u - v\right).$
+* $\psi_u$ can be stated for $(v, w)$ as $\psi_u'(v, w) = (x_1, x_2, x_3, z)$, where
+
+$$
+\begin{array}{lcl}
+  x_1 & = & v \\
+  x_2 & = & -u - v \\
+  x_3 & = & u + w^2 \\
+  z   & = & \dfrac{g(x_3)}{w}(u^2 + uv + v^2 + a) = \dfrac{-g(u)g(x_3)}{w^3}
+\end{array}
+$$
+
+We can now write the expressions for finding $(v, w)$ given $x$ explicitly, by solving each of the $\\{x_1, x_2, x_3\\}$
+expressions for $v$ or $w$, and using the $S_u'$ equation to find the other variable:
+* Assuming $x = x_1$, we find $v = x$ and $w = \pm\sqrt{-g(u)/(u^2 + uv + v^2 + a)}$ (two solutions).
+* Assuming $x = x_2$, we find $v = -u-x$ and $w = \pm\sqrt{-g(u)/(u^2 + uv + v^2 + a)}$ (two solutions).
+* Assuming $x = x_3$, we find $w = \pm\sqrt{x-u}$ and $v = -u/2 \pm \sqrt{-w^2(4g(u) + w^2h(u))}/(2w^2)$ (four solutions).
+
+### 3.2 Avoiding computing all inverses
+
+The *ElligatorSwift* algorithm as stated in Section 1 requires the computation of $L = F_u^{-1}(x)$ (the
+set of all $t$ such that $(u, t)$ decode to $x$) in full. This is unnecessary.
+
+Observe that the procedure of restarting with probability $(1 - \frac{\\#L}{8})$ and otherwise returning a
+uniformly random element from $L$ is actually equivalent to always padding $L$ with $\bot$ values up to length 8,
+picking a uniformly random element from that, restarting whenever $\bot$ is picked:
+
+**Define** *ElligatorSwift(x)* as:
+* Loop:
+  * Pick a uniformly random field element $u.$
+  * Compute the set $L = F_u^{-1}(x).$
+  * Let $T$ be the 8-element vector consisting of the elements of $L$, plus $8 - \\#L$ times $\\{\bot\\}.$
+  * Select a uniformly random $t \in T.$
+  * If $t \neq \bot$, return $(u, t)$; restart loop otherwise.
+
+Now notice that the order of elements in $T$ does not matter, as all we do is pick a uniformly
+random element in it, so we do not need to have all $\bot$ values at the end.
+As we have 8 distinct formulas for finding $(v, w)$ (taking the variants due to $\pm$ into account),
+we can associate every index in $T$ with exactly one of those formulas, making sure that:
+* Formulas that yield no solutions (due to division by zero or non-existing square roots) or invalid solutions are made to return $\bot.$
+* For the $x_1$ and $x_2$ cases, if $g(-u-x)$ is a square, $\bot$ is returned instead (the round-trip check).
+* In case multiple formulas would return the same non- $\bot$ result, all but one of those must be turned into $\bot$ to avoid biasing those.
+
+The last condition above only occurs with negligible probability for cryptographically-sized curves, but is interesting
+to take into account as it allows exhaustive testing in small groups. See [Section 3.4](#34-dealing-with-special-cases)
+for an analysis of all the negligible cases.
+
+If we define $T = (G_{0,u}(x), G_{1,u}(x), \ldots, G_{7,u}(x))$, with each $G_{i,u}$ matching one of the formulas,
+the loop can be simplified to only compute one of the inverses instead of all of them:
+
+**Define** *ElligatorSwift(x)* as:
+* Loop:
+  * Pick a uniformly random field element $u.$
+  * Pick a uniformly random integer $c$ in $[0,8).$
+  * Let $t = G_{c,u}(x).$
+  * If $t \neq \bot$, return $(u, t)$; restart loop otherwise.
+
+This is implemented in `secp256k1_ellswift_xelligatorswift_var`.
+
+### 3.3 Finding the inverse
+
+To implement $G_{c,u}$, we map $c=0$ to the $x_1$ formula, $c=1$ to the $x_2$ formula, and $c=2$ and $c=3$ to the $x_3$ formula.
+Those are then repeated as $c=4$ through $c=7$ for the other sign of $w$ (noting that in each formula, $w$ is a square root of some expression).
+Ignoring the negligible cases, we get:
+
+**Define** $G_{c,u}(x)$ as:
+* If $c \in \\{0, 1, 4, 5\\}$ (for $x_1$ and $x_2$ formulas):
+  * If $g(-u-x)$ is square, return $\bot$ (as $x_3$ would be valid and take precedence).
+  * If $c \in \\{0, 4\\}$ (the $x_1$ formula) let $v = x$, otherwise let $v = -u-x$ (the $x_2$ formula)
+  * Let $s = -g(u)/(u^2 + uv + v^2 + a)$ (using $s = w^2$ in what follows).
+* Otherwise, when $c \in \\{2, 3, 6, 7\\}$ (for $x_3$ formulas):
+  * Let $s = x-u.$
+  * Let $r = \sqrt{-s(4g(u) + sh(u))}.$
+  * Let $v = (r/s - u)/2$ if $c \in \\{3, 7\\}$; $(-r/s - u)/2$ otherwise.
+* Let $w = \sqrt{s}.$
+* Depending on $c:$
+  * If $c \in \\{0, 1, 2, 3\\}:$ return $P_u^{'-1}(v, w).$
+  * If $c \in \\{4, 5, 6, 7\\}:$ return $P_u^{'-1}(v, -w).$
+
+Whenever a square root of a non-square is taken, $\bot$ is returned; for both square roots this happens with roughly
+50% on random inputs. Similarly, when a division by 0 would occur, $\bot$ is returned as well; this will only happen
+with negligible probability. A division by 0 in the first branch in fact cannot occur at all, because $u^2 + uv + v^2 + a = 0$
+implies $g(-u-x) = g(x)$ which would mean the $g(-u-x)$ is square condition has triggered
+and $\bot$ would have been returned already.
+
+**Note**: In the paper, the $case$ variable corresponds roughly to the $c$ above, but only takes on 4 possible values (1 to 4).
+The conditional negation of $w$ at the end is done randomly, which is equivalent, but makes testing harder. We choose to
+have the $G_{c,u}$ be deterministic, and capture all choices in $c.$
+
+Now observe that the $c \in \\{1, 5\\}$ and $c \in \\{3, 7\\}$ conditions effectively perform the same $v \rightarrow -u-v$
+transformation. Furthermore, that transformation has no effect on $s$ in the first branch
+as $u^2 + ux + x^2 + a = u^2 + u(-u-x) + (-u-x)^2 + a.$ Thus we can extract it out and move it down:
+
+**Define** $G_{c,u}(x)$ as:
+* If $c \in \\{0, 1, 4, 5\\}:$
+  * If $g(-u-x)$ is square, return $\bot.$
+  * Let $s = -g(u)/(u^2 + ux + x^2 + a).$
+  * Let $v = x.$
+* Otherwise, when $c \in \\{2, 3, 6, 7\\}:$
+  * Let $s = x-u.$
+  * Let $r = \sqrt{-s(4g(u) + sh(u))}.$
+  * Let $v = (r/s - u)/2.$
+* Let $w = \sqrt{s}.$
+* Depending on $c:$
+  * If $c \in \\{0, 2\\}:$ return $P_u^{'-1}(v, w).$
+  * If $c \in \\{1, 3\\}:$ return $P_u^{'-1}(-u-v, w).$
+  * If $c \in \\{4, 6\\}:$ return $P_u^{'-1}(v, -w).$
+  * If $c \in \\{5, 7\\}:$ return $P_u^{'-1}(-u-v, -w).$
+
+This shows there will always be exactly 0, 4, or 8 $t$ values for a given $(u, x)$ input.
+There can be 0, 1, or 2 $(v, w)$ pairs before invoking $P_u^{'-1}$, and each results in 4 distinct $t$ values.
+
+### 3.4 Dealing with special cases
+
+As mentioned before there are a few cases to deal with which only happen in a negligibly small subset of inputs.
+For cryptographically sized fields, if only random inputs are going to be considered, it is unnecessary to deal with these. Still, for completeness
+we analyse them here. They generally fall into two categories: cases in which the encoder would produce $t$ values that
+do not decode back to $x$ (or at least cannot guarantee that they do), and cases in which the encoder might produce the same
+$t$ value for multiple $c$ inputs (thereby biasing that encoding):
+
+* In the branch for $x_1$ and $x_2$ (where $c \in \\{0, 1, 4, 5\\}$):
+  * When $g(u) = 0$, we would have $s=w=Y=0$, which is not on $S_u.$ This is only possible on even-ordered curves.
+    Excluding this also removes the one condition under which the simplified check for $x_3$ on the curve
+    fails (namely when $g(x_1)=g(x_2)=0$ but $g(x_3)$ is not square).
+    This does exclude some valid encodings: when both $g(u)=0$ and $u^2+ux+x^2+a=0$ (also implying $g(x)=0$),
+    the $S_u'$ equation degenerates to $0 = 0$, and many valid $t$ values may exist. Yet, these cannot be targeted uniformly by the
+    encoder anyway as there will generally be more than 8.
+  * When $g(x) = 0$, the same $t$ would be produced as in the $x_3$ branch (where $c \in \\{2, 3, 6, 7\\}$) which we give precedence
+    as it can deal with $g(u)=0$.
+    This is again only possible on even-ordered curves.
+* In the branch for $x_3$ (where $c \in \\{2, 3, 6, 7\\}$):
+  * When $s=0$, a division by zero would occur.
+  * When $v = -u-v$ and $c \in \\{3, 7\\}$, the same $t$ would be returned as in the $c \in \\{2, 6\\}$ cases.
+    It is equivalent to checking whether $r=0$.
+    This cannot occur in the $x_1$ or $x_2$ branches, as it would trigger the $g(-u-x)$ is square condition.
+    A similar concern for $w = -w$ does not exist, as $w=0$ is already impossible in both branches: in the first
+    it requires $g(u)=0$ which is already outlawed on even-ordered curves and impossible on others; in the second it would trigger division by zero.
+* Curve-specific special cases also exist that need to be rejected, because they result in $(u,t)$ which is invalid to the decoder, or because of division by zero in the encoder:
+  * For $a=0$ curves, when $u=0$ or when $t=0$. The latter can only be reached by the encoder when $g(u)=0$, which requires an even-ordered curve.
+  * For $a \neq 0$ curves, when $X_0(u)=0$, when $h(u)t^2 = -1$, or when $w(u + 2v) = 2X_0(u)$ while also either $w \neq 2Y_0(u)$ or $h(u)=0$.
+
+**Define** a version of $G_{c,u}(x)$ which deals with all these cases:
+* If $a=0$ and $u=0$, return $\bot.$
+* If $a \neq 0$ and $X_0(u)=0$, return $\bot.$
+* If $c \in \\{0, 1, 4, 5\\}:$
+  * If $g(u) = 0$ or $g(x) = 0$, return $\bot$ (even curves only).
+  * If $g(-u-x)$ is square, return $\bot.$
+  * Let $s = -g(u)/(u^2 + ux + x^2 + a)$ (cannot cause division by zero).
+  * Let $v = x.$
+* Otherwise, when $c \in \\{2, 3, 6, 7\\}:$
+  * Let $s = x-u.$
+  * Let $r = \sqrt{-s(4g(u) + sh(u))}$; return $\bot$ if not square.
+  * If $c \in \\{3, 7\\}$ and $r=0$, return $\bot.$
+  * If $s = 0$, return $\bot.$
+  * Let $v = (r/s - u)/2.$
+* Let $w = \sqrt{s}$; return $\bot$ if not square.
+* If $a \neq 0$ and $w(u+2v) = 2X_0(u)$ and either $w \neq 2Y_0(u)$ or $h(u) = 0$, return $\bot.$
+* Depending on $c:$
+  * If $c \in \\{0, 2\\}$, let $t = P_u^{'-1}(v, w).$
+  * If $c \in \\{1, 3\\}$, let $t = P_u^{'-1}(-u-v, w).$
+  * If $c \in \\{4, 6\\}$, let $t = P_u^{'-1}(v, -w).$
+  * If $c \in \\{5, 7\\}$, let $t = P_u^{'-1}(-u-v, -w).$
+* If $a=0$ and $t=0$, return $\bot$ (even curves only).
+* If $a \neq 0$ and $h(u)t^2 = -1$, return $\bot.$
+* Return $t.$
+
+Given any $u$, using this algorithm over all $x$ and $c$ values, every $t$ value will be reached exactly once,
+for an $x$ for which $F_u(t) = x$ holds, except for these cases that will not be reached:
+* All cases where $P_u(t)$ is not defined:
+  * For $a=0$ curves, when $u=0$, $t=0$, or $g(u) = -t^2.$
+  * For $a \neq 0$ curves, when $h(u)t^2 = -1$, $X_0(u) = 0$, or $Y_0(u) (1 - h(u) t^2) = 2X_0(u)t.$
+* When $g(u)=0$, the potentially many $t$ values that decode to an $x$ satisfying $g(x)=0$ using the $x_2$ formula. These were excluded by the $g(u)=0$ condition in the $c \in \\{0, 1, 4, 5\\}$ branch.
+
+These cases form a negligible subset of all $(u, t)$ for cryptographically sized curves.
+
+### 3.5 Encoding for `secp256k1`
+
+Specialized for odd-ordered $a=0$ curves:
+
+**Define** $G_{c,u}(x)$ as:
+* If $u=0$, return $\bot.$
+* If $c \in \\{0, 1, 4, 5\\}:$
+  * If $(-u-x)^3 + b$ is square, return $\bot$
+  * Let $s = -(u^3 + b)/(u^2 + ux + x^2)$ (cannot cause division by 0).
+  * Let $v = x.$
+* Otherwise, when $c \in \\{2, 3, 6, 7\\}:$
+  * Let $s = x-u.$
+  * Let $r = \sqrt{-s(4(u^3 + b) + 3su^2)}$; return $\bot$ if not square.
+  * If $c \in \\{3, 7\\}$ and $r=0$, return $\bot.$
+  * If $s = 0$, return $\bot.$
+  * Let $v = (r/s - u)/2.$
+* Let $w = \sqrt{s}$; return $\bot$ if not square.
+* Depending on $c:$
+  * If $c \in \\{0, 2\\}:$ return $w(\frac{\sqrt{-3}-1}{2}u - v).$
+  * If $c \in \\{1, 3\\}:$ return $w(\frac{\sqrt{-3}+1}{2}u + v).$
+  * If $c \in \\{4, 6\\}:$ return $w(\frac{-\sqrt{-3}+1}{2}u + v).$
+  * If $c \in \\{5, 7\\}:$ return $w(\frac{-\sqrt{-3}-1}{2}u - v).$
+
+This is implemented in `secp256k1_ellswift_xswiftec_inv_var`.
+
+And the x-only ElligatorSwift encoding algorithm is still:
+
+**Define** *ElligatorSwift(x)* as:
+* Loop:
+  * Pick a uniformly random field element $u.$
+  * Pick a uniformly random integer $c$ in $[0,8).$
+  * Let $t = G_{c,u}(x).$
+  * If $t \neq \bot$, return $(u, t)$; restart loop otherwise.
+
+Note that this logic does not take the remapped $u=0$, $t=0$, and $g(u) = -t^2$ cases into account; it just avoids them.
+While it is not impossible to make the encoder target them, this would increase the maximum number of $t$ values for a given $(u, x)$
+combination beyond 8, and thereby slow down the ElligatorSwift loop proportionally, for a negligible gain in uniformity.
+
+## 4. Encoding and decoding full *(x, y)* coordinates
+
+So far we have only addressed encoding and decoding x-coordinates, but in some cases an encoding
+for full points with $(x, y)$ coordinates is desirable. It is possible to encode this information
+in $t$ as well.
+
+Note that for any $(X, Y) \in S_u$, $(\pm X, \pm Y)$ are all on $S_u.$ Moreover, all of these are
+mapped to the same x-coordinate. Negating $X$ or negating $Y$ just results in $x_1$ and $x_2$
+being swapped, and does not affect $x_3.$ This will not change the outcome x-coordinate as the order
+of $x_1$ and $x_2$ only matters if both were to be valid, and in that case $x_3$ would be used instead.
+
+Still, these four $(X, Y)$ combinations all correspond to distinct $t$ values, so we can encode
+the sign of the y-coordinate in the sign of $X$ or the sign of $Y.$ They correspond to the
+four distinct $P_u^{'-1}$ calls in the definition of $G_{u,c}.$
+
+**Note**: In the paper, the sign of the y coordinate is encoded in a separately-coded bit.
+
+To encode the sign of $y$ in the sign of $Y:$
+
+**Define** *Decode(u, t)* for full $(x, y)$ as:
+* Let $(X, Y) = P_u(t).$
+* Let $x$ be the first value in $(u + 4Y^2, \frac{-X}{2Y} - \frac{u}{2}, \frac{X}{2Y} - \frac{u}{2})$ for which $g(x)$ is square.
+* Let $y = \sqrt{g(x)}.$
+* If $sign(y) = sign(Y)$, return $(x, y)$; otherwise return $(x, -y).$
+
+And encoding would be done using a $G_{c,u}(x, y)$ function defined as:
+
+**Define** $G_{c,u}(x, y)$ as:
+* If $c \in \\{0, 1\\}:$
+  * If $g(u) = 0$ or $g(x) = 0$, return $\bot$ (even curves only).
+  * If $g(-u-x)$ is square, return $\bot.$
+  * Let $s = -g(u)/(u^2 + ux + x^2 + a)$ (cannot cause division by zero).
+  * Let $v = x.$
+* Otherwise, when $c \in \\{2, 3\\}:$
+  * Let $s = x-u.$
+  * Let $r = \sqrt{-s(4g(u) + sh(u))}$; return $\bot$ if not square.
+  * If $c = 3$ and $r = 0$, return $\bot.$
+  * Let $v = (r/s - u)/2.$
+* Let $w = \sqrt{s}$; return $\bot$ if not square.
+* Let $w' = w$ if $sign(w/2) = sign(y)$; $-w$ otherwise.
+* Depending on $c:$
+  * If $c \in \\{0, 2\\}:$ return $P_u^{'-1}(v, w').$
+  * If $c \in \\{1, 3\\}:$ return $P_u^{'-1}(-u-v, w').$
+
+Note that $c$ now only ranges $[0,4)$, as the sign of $w'$ is decided based on that of $y$, rather than on $c.$
+This change makes some valid encodings unreachable: when $y = 0$ and $sign(Y) \neq sign(0)$.
+
+In the above logic, $sign$ can be implemented in several ways, such as parity of the integer representation
+of the input field element (for prime-sized fields) or the quadratic residuosity (for fields where
+$-1$ is not square). The choice does not matter, as long as it only takes on two possible values, and for $x \neq 0$ it holds that $sign(x) \neq sign(-x)$.
+
+### 4.1 Full *(x, y)* coordinates for `secp256k1`
+
+For $a=0$ curves, there is another option. Note that for those,
+the $P_u(t)$ function translates negations of $t$ to negations of (both) $X$ and $Y.$ Thus, we can use $sign(t)$ to
+encode the y-coordinate directly. Combined with the earlier remapping to guarantee all inputs land on the curve, we get
+as decoder:
+
+**Define** *Decode(u, t)* as:
+* Let $u'=u$ if $u \neq 0$; $1$ otherwise.
+* Let $t'=t$ if $t \neq 0$; $1$ otherwise.
+* Let $t''=t'$ if $u'^3 + b + t'^2 \neq 0$; $2t'$ otherwise.
+* Let $X = \dfrac{u'^3 + b - t''^2}{2t''}.$
+* Let $Y = \dfrac{X + t''}{u'\sqrt{-3}}.$
+* Let $x$ be the first element of $(u' + 4Y^2, \frac{-X}{2Y} - \frac{u'}{2}, \frac{X}{2Y} - \frac{u'}{2})$ for which $g(x)$ is square.
+* Let $y = \sqrt{g(x)}.$
+* Return $(x, y)$ if $sign(y) = sign(t)$; $(x, -y)$ otherwise.
+
+This is implemented in `secp256k1_ellswift_swiftec_var`. The used $sign(x)$ function is the parity of $x$ when represented as in integer in $[0,q).$
+
+The corresponding encoder would invoke the x-only one, but negating the output $t$ if $sign(t) \neq sign(y).$
+
+This is implemented in `secp256k1_ellswift_elligatorswift_var`.
+
+Note that this is only intended for encoding points where both the x-coordinate and y-coordinate are unpredictable. When encoding x-only points
+where the y-coordinate is implicitly even (or implicitly square, or implicitly in $[0,q/2]$), the encoder in
+[Section 3.5](#35-encoding-for-secp256k1) must be used, or a bias is reintroduced that undoes all the benefit of using ElligatorSwift
+in the first place.
--- a/secp256k1/doc/musig.md
+++ b/secp256k1/doc/musig.md
@@ -0,0 +1,54 @@
+Notes on the musig module API
+===========================
+
+The following sections contain additional notes on the API of the musig module (`include/secp256k1_musig.h`).
+A usage example can be found in `examples/musig.c`.
+
+## API misuse
+
+The musig API is designed with a focus on misuse resistance.
+However, due to the interactive nature of the MuSig protocol, there are additional failure modes that are not present in regular (single-party) Schnorr signature creation.
+While the results can be catastrophic (e.g. leaking of the secret key), it is unfortunately not possible for the musig implementation to prevent all such failure modes.
+
+Therefore, users of the musig module must take great care to make sure of the following:
+
+1. A unique nonce per signing session is generated in `secp256k1_musig_nonce_gen`.
+   See the corresponding comment in `include/secp256k1_musig.h` for how to ensure that.
+2. The `secp256k1_musig_secnonce` structure is never copied or serialized.
+   See also the comment on `secp256k1_musig_secnonce` in `include/secp256k1_musig.h`.
+3. Opaque data structures are never written to or read from directly.
+   Instead, only the provided accessor functions are used.
+
+## Key Aggregation and (Taproot) Tweaking
+
+Given a set of public keys, the aggregate public key is computed with `secp256k1_musig_pubkey_agg`.
+A plain tweak can be added to the resulting public key with `secp256k1_ec_pubkey_tweak_add` by setting the `tweak32` argument to the hash defined in BIP 32. Similarly, a Taproot tweak can be added with `secp256k1_xonly_pubkey_tweak_add` by setting the `tweak32` argument to the TapTweak hash defined in BIP 341.
+Both types of tweaking can be combined and invoked multiple times if the specific application requires it.
+
+## Signing
+
+This is covered by `examples/musig.c`.
+Essentially, the protocol proceeds in the following steps:
+
+1. Generate a keypair with `secp256k1_keypair_create` and obtain the public key with `secp256k1_keypair_pub`.
+2. Call `secp256k1_musig_pubkey_agg` with the pubkeys of all participants.
+3. Optionally add a (Taproot) tweak with `secp256k1_musig_pubkey_xonly_tweak_add` and a plain tweak with `secp256k1_musig_pubkey_ec_tweak_add`.
+4. Generate a pair of secret and public nonce with `secp256k1_musig_nonce_gen` and send the public nonce to the other signers.
+5. Someone (not necessarily the signer) aggregates the public nonces with `secp256k1_musig_nonce_agg` and sends it to the signers.
+6. Process the aggregate nonce with `secp256k1_musig_nonce_process`.
+7. Create a partial signature with `secp256k1_musig_partial_sign`.
+8. Verify the partial signatures (optional in some scenarios) with `secp256k1_musig_partial_sig_verify`.
+9. Someone (not necessarily the signer) obtains all partial signatures and aggregates them into the final Schnorr signature using `secp256k1_musig_partial_sig_agg`.
+
+The aggregate signature can be verified with `secp256k1_schnorrsig_verify`.
+
+Steps 1 through 5 above can occur before or after the signers are aware of the message to be signed.
+Whenever possible, it is recommended to generate the nonces only after the message is known.
+This provides enhanced defense-in-depth measures, protecting against potential API misuse in certain scenarios.
+However, it does require two rounds of communication during the signing process.
+The alternative, generating the nonces in a pre-processing step before the message is known, eliminates these additional protective measures but allows for non-interactive signing.
+Similarly, the API supports an alternative protocol flow where generating the aggregate key (steps 1 to 3) is allowed to happen after exchanging nonces (steps 4 to 5).
+
+## Verification
+
+A participant who wants to verify the partial signatures, but does not sign itself may do so using the above instructions except that the verifier skips steps 1, 4 and 7.
--- a/secp256k1/doc/release-process.md
+++ b/secp256k1/doc/release-process.md
@@ -0,0 +1,94 @@
+# Release process
+
+This document outlines the process for releasing versions of the form `$MAJOR.$MINOR.$PATCH`.
+
+We distinguish between two types of releases: *regular* and *maintenance* releases.
+Regular releases are releases of a new major or minor version as well as patches of the most recent release.
+Maintenance releases, on the other hand, are required for patches of older releases.
+
+You should coordinate with the other maintainers on the release date, if possible.
+This date will be part of the release entry in [CHANGELOG.md](../CHANGELOG.md) and it should match the dates of the remaining steps in the release process (including the date of the tag and the GitHub release).
+It is best if the maintainers are present during the release, so they can help ensure that the process is followed correctly and, in the case of a regular release, they are aware that they should not modify the master branch between merging the PR in step 1 and the PR in step 3.
+
+This process also assumes that there will be no minor releases for old major releases.
+
+We aim to cut a regular release every 3-4 months, approximately twice as frequent as major Bitcoin Core releases. Every second release should be published one month before the feature freeze of the next major Bitcoin Core release, allowing sufficient time to update the library in Core.
+
+## Sanity checks
+Perform these checks when reviewing the release PR (see below):
+
+1. Ensure `make distcheck` doesn't fail.
+   ```shell
+   ./autogen.sh && ./configure --enable-dev-mode && make distcheck
+   ```
+2. Check installation with autotools:
+   ```shell
+   dir=$(mktemp -d)
+   ./autogen.sh && ./configure --prefix=$dir && make clean && make install && ls -RlAh $dir
+   gcc -o ecdsa examples/ecdsa.c $(PKG_CONFIG_PATH=$dir/lib/pkgconfig pkg-config --cflags --libs libsecp256k1) -Wl,-rpath,"$dir/lib" && ./ecdsa
+   ```
+3. Check installation with CMake:
+   ```shell
+   dir=$(mktemp -d)
+   build=$(mktemp -d)
+   cmake -B $build -DCMAKE_INSTALL_PREFIX=$dir && cmake --build $build && cmake --install $build && ls -RlAh $dir
+   gcc -o ecdsa examples/ecdsa.c -I $dir/include -L $dir/lib*/ -l secp256k1 -Wl,-rpath,"$dir/lib",-rpath,"$dir/lib64" && ./ecdsa
+   ```
+4. Use the [`check-abi.sh`](/tools/check-abi.sh) tool to verify that there are no unexpected ABI incompatibilities and that the version number and the release notes accurately reflect all potential ABI changes. To run this tool, the `abi-dumper` and `abi-compliance-checker` packages are required.
+   ```shell
+   tools/check-abi.sh
+   ```
+
+## Regular release
+
+1. Open a PR to the master branch with a commit (using message `"release: prepare for $MAJOR.$MINOR.$PATCH"`, for example) that
+   * finalizes the release notes in [CHANGELOG.md](../CHANGELOG.md) by
+       * adding a section for the release (make sure that the version number is a link to a diff between the previous and new version),
+       * removing the `[Unreleased]` section header,
+       * ensuring that the release notes are not missing entries (check the `needs-changelog` label on github), and
+       * including an entry for `### ABI Compatibility` if it doesn't exist,
+   * sets `_PKG_VERSION_IS_RELEASE` to `true` in `configure.ac`, and,
+   * if this is not a patch release,
+       * updates `_PKG_VERSION_*` and `_LIB_VERSION_*`  in `configure.ac`, and
+       * updates `project(libsecp256k1 VERSION ...)` and `${PROJECT_NAME}_LIB_VERSION_*` in `CMakeLists.txt`.
+2. Perform the [sanity checks](#sanity-checks) on the PR branch.
+3. After the PR is merged, tag the commit, and push the tag:
+   ```
+   RELEASE_COMMIT=<merge commit of step 1>
+   git tag -s v$MAJOR.$MINOR.$PATCH -m "libsecp256k1 $MAJOR.$MINOR.$PATCH" $RELEASE_COMMIT
+   git push git@github.com:bitcoin-core/secp256k1.git v$MAJOR.$MINOR.$PATCH
+   ```
+4. Open a PR to the master branch with a commit (using message `"release cleanup: bump version after $MAJOR.$MINOR.$PATCH"`, for example) that
+   * sets `_PKG_VERSION_IS_RELEASE` to `false` and increments `_PKG_VERSION_PATCH` and `_LIB_VERSION_REVISION` in `configure.ac`,
+   * increments the `$PATCH` component of `project(libsecp256k1 VERSION ...)` and `${PROJECT_NAME}_LIB_VERSION_REVISION` in `CMakeLists.txt`, and
+   * adds an `[Unreleased]` section header to the [CHANGELOG.md](../CHANGELOG.md).
+
+   If other maintainers are not present to approve the PR, it can be merged without ACKs.
+5. Create a new GitHub release with a link to the corresponding entry in [CHANGELOG.md](../CHANGELOG.md).
+6. Send an announcement email to the bitcoin-dev mailing list.
+
+## Maintenance release
+
+Note that bug fixes need to be backported only to releases for which no compatible release without the bug exists.
+
+1. If there's no maintenance branch `$MAJOR.$MINOR`, create one:
+   ```
+   git checkout -b $MAJOR.$MINOR v$MAJOR.$MINOR.$((PATCH - 1))
+   git push git@github.com:bitcoin-core/secp256k1.git $MAJOR.$MINOR
+   ```
+2. Open a pull request to the `$MAJOR.$MINOR` branch that
+   * includes the bug fixes,
+   * finalizes the release notes similar to a regular release,
+   * increments `_PKG_VERSION_PATCH` and `_LIB_VERSION_REVISION` in `configure.ac`
+     and the `$PATCH` component of `project(libsecp256k1 VERSION ...)` and `${PROJECT_NAME}_LIB_VERSION_REVISION` in `CMakeLists.txt`
+     (with commit message `"release: bump versions for $MAJOR.$MINOR.$PATCH"`, for example).
+3. Perform the [sanity checks](#sanity-checks) on the PR branch.
+4. After the PRs are merged, update the release branch, tag the commit, and push the tag:
+   ```
+   git checkout $MAJOR.$MINOR && git pull
+   git tag -s v$MAJOR.$MINOR.$PATCH -m "libsecp256k1 $MAJOR.$MINOR.$PATCH"
+   git push git@github.com:bitcoin-core/secp256k1.git v$MAJOR.$MINOR.$PATCH
+   ```
+6. Create a new GitHub release with a link to the corresponding entry in [CHANGELOG.md](../CHANGELOG.md).
+7. Send an announcement email to the bitcoin-dev mailing list.
+8. Open PR to the master branch that includes a commit (with commit message `"release notes: add $MAJOR.$MINOR.$PATCH"`, for example) that adds release notes to [CHANGELOG.md](../CHANGELOG.md).
--- a/secp256k1/doc/safegcd_implementation.md
+++ b/secp256k1/doc/safegcd_implementation.md
@@ -0,0 +1,819 @@
+# The safegcd implementation in libsecp256k1 explained
+
+This document explains the modular inverse and Jacobi symbol implementations in the `src/modinv*.h` files.
+It is based on the paper
+["Fast constant-time gcd computation and modular inversion"](https://gcd.cr.yp.to/papers.html#safegcd)
+by Daniel J. Bernstein and Bo-Yin Yang. The references below are for the Date: 2019.04.13 version.
+
+The actual implementation is in C of course, but for demonstration purposes Python3 is used here.
+Most implementation aspects and optimizations are explained, except those that depend on the specific
+number representation used in the C code.
+
+## 1. Computing the Greatest Common Divisor (GCD) using divsteps
+
+The algorithm from the paper (section 11), at a very high level, is this:
+
+```python
+def gcd(f, g):
+    """Compute the GCD of an odd integer f and another integer g."""
+    assert f & 1  # require f to be odd
+    delta = 1     # additional state variable
+    while g != 0:
+        assert f & 1  # f will be odd in every iteration
+        if delta > 0 and g & 1:
+            delta, f, g = 1 - delta, g, (g - f) // 2
+        elif g & 1:
+            delta, f, g = 1 + delta, f, (g + f) // 2
+        else:
+            delta, f, g = 1 + delta, f, (g    ) // 2
+    return abs(f)
+```
+
+It computes the greatest common divisor of an odd integer *f* and any integer *g*. Its inner loop
+keeps rewriting the variables *f* and *g* alongside a state variable *&delta;* that starts at *1*, until
+*g=0* is reached. At that point, *|f|* gives the GCD. Each of the transitions in the loop is called a
+"division step" (referred to as divstep in what follows).
+
+For example, *gcd(21, 14)* would be computed as:
+- Start with *&delta;=1 f=21 g=14*
+- Take the third branch: *&delta;=2 f=21 g=7*
+- Take the first branch: *&delta;=-1 f=7 g=-7*
+- Take the second branch: *&delta;=0 f=7 g=0*
+- The answer *|f| = 7*.
+
+Why it works:
+- Divsteps can be decomposed into two steps (see paragraph 8.2 in the paper):
+  - (a) If *g* is odd, replace *(f,g)* with *(g,g-f)* or (f,g+f), resulting in an even *g*.
+  - (b) Replace *(f,g)* with *(f,g/2)* (where *g* is guaranteed to be even).
+- Neither of those two operations change the GCD:
+  - For (a), assume *gcd(f,g)=c*, then it must be the case that *f=a&thinsp;c* and *g=b&thinsp;c* for some integers *a*
+    and *b*. As *(g,g-f)=(b&thinsp;c,(b-a)c)* and *(f,f+g)=(a&thinsp;c,(a+b)c)*, the result clearly still has
+    common factor *c*. Reasoning in the other direction shows that no common factor can be added by
+    doing so either.
+  - For (b), we know that *f* is odd, so *gcd(f,g)* clearly has no factor *2*, and we can remove
+    it from *g*.
+- The algorithm will eventually converge to *g=0*. This is proven in the paper (see theorem G.3).
+- It follows that eventually we find a final value *f'* for which *gcd(f,g) = gcd(f',0)*. As the
+  gcd of *f'* and *0* is *|f'|* by definition, that is our answer.
+
+Compared to more [traditional GCD algorithms](https://en.wikipedia.org/wiki/Euclidean_algorithm), this one has the property of only ever looking at
+the low-order bits of the variables to decide the next steps, and being easy to make
+constant-time (in more low-level languages than Python). The *&delta;* parameter is necessary to
+guide the algorithm towards shrinking the numbers' magnitudes without explicitly needing to look
+at high order bits.
+
+Properties that will become important later:
+- Performing more divsteps than needed is not a problem, as *f* does not change anymore after *g=0*.
+- Only even numbers are divided by *2*. This means that when reasoning about it algebraically we
+  do not need to worry about rounding.
+- At every point during the algorithm's execution the next *N* steps only depend on the bottom *N*
+  bits of *f* and *g*, and on *&delta;*.
+
+
+## 2. From GCDs to modular inverses
+
+We want an algorithm to compute the inverse *a* of *x* modulo *M*, i.e. the number a such that *a&thinsp;x=1
+mod M*. This inverse only exists if the GCD of *x* and *M* is *1*, but that is always the case if *M* is
+prime and *0 < x < M*. In what follows, assume that the modular inverse exists.
+It turns out this inverse can be computed as a side effect of computing the GCD by keeping track
+of how the internal variables can be written as linear combinations of the inputs at every step
+(see the [extended Euclidean algorithm](https://en.wikipedia.org/wiki/Extended_Euclidean_algorithm)).
+Since the GCD is *1*, such an algorithm will compute numbers *a* and *b* such that a&thinsp;x + b&thinsp;M = 1*.
+Taking that expression *mod M* gives *a&thinsp;x mod M = 1*, and we see that *a* is the modular inverse of *x
+mod M*.
+
+A similar approach can be used to calculate modular inverses using the divsteps-based GCD
+algorithm shown above, if the modulus *M* is odd. To do so, compute *gcd(f=M,g=x)*, while keeping
+track of extra variables *d* and *e*, for which at every step *d = f/x (mod M)* and *e = g/x (mod M)*.
+*f/x* here means the number which multiplied with *x* gives *f mod M*. As *f* and *g* are initialized to *M*
+and *x* respectively, *d* and *e* just start off being *0* (*M/x mod M = 0/x mod M = 0*) and *1* (*x/x mod M
+= 1*).
+
+```python
+def div2(M, x):
+    """Helper routine to compute x/2 mod M (where M is odd)."""
+    assert M & 1
+    if x & 1: # If x is odd, make it even by adding M.
+        x += M
+    # x must be even now, so a clean division by 2 is possible.
+    return x // 2
+
+def modinv(M, x):
+    """Compute the inverse of x mod M (given that it exists, and M is odd)."""
+    assert M & 1
+    delta, f, g, d, e = 1, M, x, 0, 1
+    while g != 0:
+        # Note that while division by two for f and g is only ever done on even inputs, this is
+        # not true for d and e, so we need the div2 helper function.
+        if delta > 0 and g & 1:
+            delta, f, g, d, e = 1 - delta, g, (g - f) // 2, e, div2(M, e - d)
+        elif g & 1:
+            delta, f, g, d, e = 1 + delta, f, (g + f) // 2, d, div2(M, e + d)
+        else:
+            delta, f, g, d, e = 1 + delta, f, (g    ) // 2, d, div2(M, e    )
+        # Verify that the invariants d=f/x mod M, e=g/x mod M are maintained.
+        assert f % M == (d * x) % M
+        assert g % M == (e * x) % M
+    assert f == 1 or f == -1  # |f| is the GCD, it must be 1
+    # Because of invariant d = f/x (mod M), 1/x = d/f (mod M). As |f|=1, d/f = d*f.
+    return (d * f) % M
+```
+
+Also note that this approach to track *d* and *e* throughout the computation to determine the inverse
+is different from the paper. There (see paragraph 12.1 in the paper) a transition matrix for the
+entire computation is determined (see section 3 below) and the inverse is computed from that.
+The approach here avoids the need for 2x2 matrix multiplications of various sizes, and appears to
+be faster at the level of optimization we're able to do in C.
+
+
+## 3. Batching multiple divsteps
+
+Every divstep can be expressed as a matrix multiplication, applying a transition matrix *(1/2 t)*
+to both vectors *[f, g]* and *[d, e]* (see paragraph 8.1 in the paper):
+
+```
+  t = [ u,  v ]
+      [ q,  r ]
+
+  [ out_f ] = (1/2 * t) * [ in_f ]
+  [ out_g ] =             [ in_g ]
+
+  [ out_d ] = (1/2 * t) * [ in_d ]  (mod M)
+  [ out_e ]               [ in_e ]
+```
+
+where *(u, v, q, r)* is *(0, 2, -1, 1)*, *(2, 0, 1, 1)*, or *(2, 0, 0, 1)*, depending on which branch is
+taken. As above, the resulting *f* and *g* are always integers.
+
+Performing multiple divsteps corresponds to a multiplication with the product of all the
+individual divsteps' transition matrices. As each transition matrix consists of integers
+divided by *2*, the product of these matrices will consist of integers divided by *2<sup>N</sup>* (see also
+theorem 9.2 in the paper). These divisions are expensive when updating *d* and *e*, so we delay
+them: we compute the integer coefficients of the combined transition matrix scaled by *2<sup>N</sup>*, and
+do one division by *2<sup>N</sup>* as a final step:
+
+```python
+def divsteps_n_matrix(delta, f, g):
+    """Compute delta and transition matrix t after N divsteps (multiplied by 2^N)."""
+    u, v, q, r = 1, 0, 0, 1 # start with identity matrix
+    for _ in range(N):
+        if delta > 0 and g & 1:
+            delta, f, g, u, v, q, r = 1 - delta, g, (g - f) // 2, 2*q, 2*r, q-u, r-v
+        elif g & 1:
+            delta, f, g, u, v, q, r = 1 + delta, f, (g + f) // 2, 2*u, 2*v, q+u, r+v
+        else:
+            delta, f, g, u, v, q, r = 1 + delta, f, (g    ) // 2, 2*u, 2*v, q  , r
+    return delta, (u, v, q, r)
+```
+
+As the branches in the divsteps are completely determined by the bottom *N* bits of *f* and *g*, this
+function to compute the transition matrix only needs to see those bottom bits. Furthermore all
+intermediate results and outputs fit in *(N+1)*-bit numbers (unsigned for *f* and *g*; signed for *u*, *v*,
+*q*, and *r*) (see also paragraph 8.3 in the paper). This means that an implementation using 64-bit
+integers could set *N=62* and compute the full transition matrix for 62 steps at once without any
+big integer arithmetic at all. This is the reason why this algorithm is efficient: it only needs
+to update the full-size *f*, *g*, *d*, and *e* numbers once every *N* steps.
+
+We still need functions to compute:
+
+```
+  [ out_f ] = (1/2^N * [ u,  v ]) * [ in_f ]
+  [ out_g ]   (        [ q,  r ])   [ in_g ]
+
+  [ out_d ] = (1/2^N * [ u,  v ]) * [ in_d ]  (mod M)
+  [ out_e ]   (        [ q,  r ])   [ in_e ]
+```
+
+Because the divsteps transformation only ever divides even numbers by two, the result of *t&thinsp;[f,g]* is always even. When *t* is a composition of *N* divsteps, it follows that the resulting *f*
+and *g* will be multiple of *2<sup>N</sup>*, and division by *2<sup>N</sup>* is simply shifting them down:
+
+```python
+def update_fg(f, g, t):
+    """Multiply matrix t/2^N with [f, g]."""
+    u, v, q, r = t
+    cf, cg = u*f + v*g, q*f + r*g
+    # (t / 2^N) should cleanly apply to [f,g] so the result of t*[f,g] should have N zero
+    # bottom bits.
+    assert cf % 2**N == 0
+    assert cg % 2**N == 0
+    return cf >> N, cg >> N
+```
+
+The same is not true for *d* and *e*, and we need an equivalent of the `div2` function for division by *2<sup>N</sup> mod M*.
+This is easy if we have precomputed *1/M mod 2<sup>N</sup>* (which always exists for odd *M*):
+
+```python
+def div2n(M, Mi, x):
+    """Compute x/2^N mod M, given Mi = 1/M mod 2^N."""
+    assert (M * Mi) % 2**N == 1
+    # Find a factor m such that m*M has the same bottom N bits as x. We want:
+    #     (m * M) mod 2^N = x mod 2^N
+    # <=> m mod 2^N = (x / M) mod 2^N
+    # <=> m mod 2^N = (x * Mi) mod 2^N
+    m = (Mi * x) % 2**N
+    # Subtract that multiple from x, cancelling its bottom N bits.
+    x -= m * M
+    # Now a clean division by 2^N is possible.
+    assert x % 2**N == 0
+    return (x >> N) % M
+
+def update_de(d, e, t, M, Mi):
+    """Multiply matrix t/2^N with [d, e], modulo M."""
+    u, v, q, r = t
+    cd, ce = u*d + v*e, q*d + r*e
+    return div2n(M, Mi, cd), div2n(M, Mi, ce)
+```
+
+With all of those, we can write a version of `modinv` that performs *N* divsteps at once:
+
+```python3
+def modinv(M, Mi, x):
+    """Compute the modular inverse of x mod M, given Mi=1/M mod 2^N."""
+    assert M & 1
+    delta, f, g, d, e = 1, M, x, 0, 1
+    while g != 0:
+        # Compute the delta and transition matrix t for the next N divsteps (this only needs
+        # (N+1)-bit signed integer arithmetic).
+        delta, t = divsteps_n_matrix(delta, f % 2**N, g % 2**N)
+        # Apply the transition matrix t to [f, g]:
+        f, g = update_fg(f, g, t)
+        # Apply the transition matrix t to [d, e]:
+        d, e = update_de(d, e, t, M, Mi)
+    return (d * f) % M
+```
+
+This means that in practice we'll always perform a multiple of *N* divsteps. This is not a problem
+because once *g=0*, further divsteps do not affect *f*, *g*, *d*, or *e* anymore (only *&delta;* keeps
+increasing). For variable time code such excess iterations will be mostly optimized away in later
+sections.
+
+
+## 4. Avoiding modulus operations
+
+So far, there are two places where we compute a remainder of big numbers modulo *M*: at the end of
+`div2n` in every `update_de`, and at the very end of `modinv` after potentially negating *d* due to the
+sign of *f*. These are relatively expensive operations when done generically.
+
+To deal with the modulus operation in `div2n`, we simply stop requiring *d* and *e* to be in range
+*[0,M)* all the time. Let's start by inlining `div2n` into `update_de`, and dropping the modulus
+operation at the end:
+
+```python
+def update_de(d, e, t, M, Mi):
+    """Multiply matrix t/2^N with [d, e] mod M, given Mi=1/M mod 2^N."""
+    u, v, q, r = t
+    cd, ce = u*d + v*e, q*d + r*e
+    # Cancel out bottom N bits of cd and ce.
+    md = -((Mi * cd) % 2**N)
+    me = -((Mi * ce) % 2**N)
+    cd += md * M
+    ce += me * M
+    # And cleanly divide by 2**N.
+    return cd >> N, ce >> N
+```
+
+Let's look at bounds on the ranges of these numbers. It can be shown that *|u|+|v|* and *|q|+|r|*
+never exceed *2<sup>N</sup>* (see paragraph 8.3 in the paper), and thus a multiplication with *t* will have
+outputs whose absolute values are at most *2<sup>N</sup>* times the maximum absolute input value. In case the
+inputs *d* and *e* are in *(-M,M)*, which is certainly true for the initial values *d=0* and *e=1* assuming
+*M > 1*, the multiplication results in numbers in range *(-2<sup>N</sup>M,2<sup>N</sup>M)*. Subtracting less than *2<sup>N</sup>*
+times *M* to cancel out *N* bits brings that up to *(-2<sup>N+1</sup>M,2<sup>N</sup>M)*, and
+dividing by *2<sup>N</sup>* at the end takes it to *(-2M,M)*. Another application of `update_de` would take that
+to *(-3M,2M)*, and so forth. This progressive expansion of the variables' ranges can be
+counteracted by incrementing *d* and *e* by *M* whenever they're negative:
+
+```python
+    ...
+    if d < 0:
+        d += M
+    if e < 0:
+        e += M
+    cd, ce = u*d + v*e, q*d + r*e
+    # Cancel out bottom N bits of cd and ce.
+    ...
+```
+
+With inputs in *(-2M,M)*, they will first be shifted into range *(-M,M)*, which means that the
+output will again be in *(-2M,M)*, and this remains the case regardless of how many `update_de`
+invocations there are. In what follows, we will try to make this more efficient.
+
+Note that increasing *d* by *M* is equal to incrementing *cd* by *u&thinsp;M* and *ce* by *q&thinsp;M*. Similarly,
+increasing *e* by *M* is equal to incrementing *cd* by *v&thinsp;M* and *ce* by *r&thinsp;M*. So we could instead write:
+
+```python
+    ...
+    cd, ce = u*d + v*e, q*d + r*e
+    # Perform the equivalent of incrementing d, e by M when they're negative.
+    if d < 0:
+        cd += u*M
+        ce += q*M
+    if e < 0:
+        cd += v*M
+        ce += r*M
+    # Cancel out bottom N bits of cd and ce.
+    md = -((Mi * cd) % 2**N)
+    me = -((Mi * ce) % 2**N)
+    cd += md * M
+    ce += me * M
+    ...
+```
+
+Now note that we have two steps of corrections to *cd* and *ce* that add multiples of *M*: this
+increment, and the decrement that cancels out bottom bits. The second one depends on the first
+one, but they can still be efficiently combined by only computing the bottom bits of *cd* and *ce*
+at first, and using that to compute the final *md*, *me* values:
+
+```python
+def update_de(d, e, t, M, Mi):
+    """Multiply matrix t/2^N with [d, e], modulo M."""
+    u, v, q, r = t
+    md, me = 0, 0
+    # Compute what multiples of M to add to cd and ce.
+    if d < 0:
+        md += u
+        me += q
+    if e < 0:
+        md += v
+        me += r
+    # Compute bottom N bits of t*[d,e] + M*[md,me].
+    cd, ce = (u*d + v*e + md*M) % 2**N, (q*d + r*e + me*M) % 2**N
+    # Correct md and me such that the bottom N bits of t*[d,e] + M*[md,me] are zero.
+    md -= (Mi * cd) % 2**N
+    me -= (Mi * ce) % 2**N
+    # Do the full computation.
+    cd, ce = u*d + v*e + md*M, q*d + r*e + me*M
+    # And cleanly divide by 2**N.
+    return cd >> N, ce >> N
+```
+
+One last optimization: we can avoid the *md&thinsp;M* and *me&thinsp;M* multiplications in the bottom bits of *cd*
+and *ce* by moving them to the *md* and *me* correction:
+
+```python
+    ...
+    # Compute bottom N bits of t*[d,e].
+    cd, ce = (u*d + v*e) % 2**N, (q*d + r*e) % 2**N
+    # Correct md and me such that the bottom N bits of t*[d,e]+M*[md,me] are zero.
+    # Note that this is not the same as {md = (-Mi * cd) % 2**N} etc. That would also result in N
+    # zero bottom bits, but isn't guaranteed to be a reduction of [0,2^N) compared to the
+    # previous md and me values, and thus would violate our bounds analysis.
+    md -= (Mi*cd + md) % 2**N
+    me -= (Mi*ce + me) % 2**N
+    ...
+```
+
+The resulting function takes *d* and *e* in range *(-2M,M)* as inputs, and outputs values in the same
+range. That also means that the *d* value at the end of `modinv` will be in that range, while we want
+a result in *[0,M)*. To do that, we need a normalization function. It's easy to integrate the
+conditional negation of *d* (based on the sign of *f*) into it as well:
+
+```python
+def normalize(sign, v, M):
+    """Compute sign*v mod M, where v is in range (-2*M,M); output in [0,M)."""
+    assert sign == 1 or sign == -1
+    # v in (-2*M,M)
+    if v < 0:
+        v += M
+    # v in (-M,M). Now multiply v with sign (which can only be 1 or -1).
+    if sign == -1:
+        v = -v
+    # v in (-M,M)
+    if v < 0:
+        v += M
+    # v in [0,M)
+    return v
+```
+
+And calling it in `modinv` is simply:
+
+```python
+   ...
+   return normalize(f, d, M)
+```
+
+
+## 5. Constant-time operation
+
+The primary selling point of the algorithm is fast constant-time operation. What code flow still
+depends on the input data so far?
+
+- the number of iterations of the while *g &ne; 0* loop in `modinv`
+- the branches inside `divsteps_n_matrix`
+- the sign checks in `update_de`
+- the sign checks in `normalize`
+
+To make the while loop in `modinv` constant time it can be replaced with a constant number of
+iterations. The paper proves (Theorem 11.2) that *741* divsteps are sufficient for any *256*-bit
+inputs, and [safegcd-bounds](https://github.com/sipa/safegcd-bounds) shows that the slightly better bound *724* is
+sufficient even. Given that every loop iteration performs *N* divsteps, it will run a total of
+*&lceil;724/N&rceil;* times.
+
+To deal with the branches in `divsteps_n_matrix` we will replace them with constant-time bitwise
+operations (and hope the C compiler isn't smart enough to turn them back into branches; see
+`ctime_tests.c` for automated tests that this isn't the case). To do so, observe that a
+divstep can be written instead as (compare to the inner loop of `gcd` in section 1).
+
+```python
+    x = -f if delta > 0 else f         # set x equal to (input) -f or f
+    if g & 1:
+        g += x                         # set g to (input) g-f or g+f
+        if delta > 0:
+            delta = -delta
+            f += g                     # set f to (input) g (note that g was set to g-f before)
+    delta += 1
+    g >>= 1
+```
+
+To convert the above to bitwise operations, we rely on a trick to negate conditionally: per the
+definition of negative numbers in two's complement, (*-v == ~v + 1*) holds for every number *v*. As
+*-1* in two's complement is all *1* bits, bitflipping can be expressed as xor with *-1*. It follows
+that *-v == (v ^ -1) - (-1)*. Thus, if we have a variable *c* that takes on values *0* or *-1*, then
+*(v ^ c) - c* is *v* if *c=0* and *-v* if *c=-1*.
+
+Using this we can write:
+
+```python
+    x = -f if delta > 0 else f
+```
+
+in constant-time form as:
+
+```python
+    c1 = (-delta) >> 63
+    # Conditionally negate f based on c1:
+    x = (f ^ c1) - c1
+```
+
+To use that trick, we need a helper mask variable *c1* that resolves the condition *&delta;>0* to *-1*
+(if true) or *0* (if false). We compute *c1* using right shifting, which is equivalent to dividing by
+the specified power of *2* and rounding down (in Python, and also in C under the assumption of a typical two's complement system; see
+`assumptions.h` for tests that this is the case). Right shifting by *63* thus maps all
+numbers in range *[-2<sup>63</sup>,0)* to *-1*, and numbers in range *[0,2<sup>63</sup>)* to *0*.
+
+Using the facts that *x&0=0* and *x&(-1)=x* (on two's complement systems again), we can write:
+
+```python
+    if g & 1:
+        g += x
+```
+
+as:
+
+```python
+    # Compute c2=0 if g is even and c2=-1 if g is odd.
+    c2 = -(g & 1)
+    # This masks out x if g is even, and leaves x be if g is odd.
+    g += x & c2
+```
+
+Using the conditional negation trick again we can write:
+
+```python
+    if g & 1:
+        if delta > 0:
+            delta = -delta
+```
+
+as:
+
+```python
+    # Compute c3=-1 if g is odd and delta>0, and 0 otherwise.
+    c3 = c1 & c2
+    # Conditionally negate delta based on c3:
+    delta = (delta ^ c3) - c3
+```
+
+Finally:
+
+```python
+    if g & 1:
+        if delta > 0:
+            f += g
+```
+
+becomes:
+
+```python
+    f += g & c3
+```
+
+It turns out that this can be implemented more efficiently by applying the substitution
+*&eta;=-&delta;*. In this representation, negating *&delta;* corresponds to negating *&eta;*, and incrementing
+*&delta;* corresponds to decrementing *&eta;*. This allows us to remove the negation in the *c1*
+computation:
+
+```python
+    # Compute a mask c1 for eta < 0, and compute the conditional negation x of f:
+    c1 = eta >> 63
+    x = (f ^ c1) - c1
+    # Compute a mask c2 for odd g, and conditionally add x to g:
+    c2 = -(g & 1)
+    g += x & c2
+    # Compute a mask c for (eta < 0) and odd (input) g, and use it to conditionally negate eta,
+    # and add g to f:
+    c3 = c1 & c2
+    eta = (eta ^ c3) - c3
+    f += g & c3
+    # Incrementing delta corresponds to decrementing eta.
+    eta -= 1
+    g >>= 1
+```
+
+A variant of divsteps with better worst-case performance can be used instead: starting *&delta;* at
+*1/2* instead of *1*. This reduces the worst case number of iterations to *590* for *256*-bit inputs
+(which can be shown using convex hull analysis). In this case, the substitution *&zeta;=-(&delta;+1/2)*
+is used instead to keep the variable integral. Incrementing *&delta;* by *1* still translates to
+decrementing *&zeta;* by *1*, but negating *&delta;* now corresponds to going from *&zeta;* to *-(&zeta;+1)*, or
+*~&zeta;*. Doing that conditionally based on *c3* is simply:
+
+```python
+    ...
+    c3 = c1 & c2
+    zeta ^= c3
+    ...
+```
+
+By replacing the loop in `divsteps_n_matrix` with a variant of the divstep code above (extended to
+also apply all *f* operations to *u*, *v* and all *g* operations to *q*, *r*), a constant-time version of
+`divsteps_n_matrix` is obtained. The full code will be in section 7.
+
+These bit fiddling tricks can also be used to make the conditional negations and additions in
+`update_de` and `normalize` constant-time.
+
+
+## 6. Variable-time optimizations
+
+In section 5, we modified the `divsteps_n_matrix` function (and a few others) to be constant time.
+Constant time operations are only necessary when computing modular inverses of secret data. In
+other cases, it slows down calculations unnecessarily. In this section, we will construct a
+faster non-constant time `divsteps_n_matrix` function.
+
+To do so, first consider yet another way of writing the inner loop of divstep operations in
+`gcd` from section 1. This decomposition is also explained in the paper in section 8.2. We use
+the original version with initial *&delta;=1* and *&eta;=-&delta;* here.
+
+```python
+for _ in range(N):
+    if g & 1 and eta < 0:
+        eta, f, g = -eta, g, -f
+    if g & 1:
+        g += f
+    eta -= 1
+    g >>= 1
+```
+
+Whenever *g* is even, the loop only shifts *g* down and decreases *&eta;*. When *g* ends in multiple zero
+bits, these iterations can be consolidated into one step. This requires counting the bottom zero
+bits efficiently, which is possible on most platforms; it is abstracted here as the function
+`count_trailing_zeros`.
+
+```python
+def count_trailing_zeros(v):
+    """
+    When v is zero, consider all N zero bits as "trailing".
+    For a non-zero value v, find z such that v=(d<<z) for some odd d.
+    """
+    if v == 0:
+        return N
+    else:
+        return (v & -v).bit_length() - 1
+
+i = N # divsteps left to do
+while True:
+    # Get rid of all bottom zeros at once. In the first iteration, g may be odd and the following
+    # lines have no effect (until "if eta < 0").
+    zeros = min(i, count_trailing_zeros(g))
+    eta -= zeros
+    g >>= zeros
+    i -= zeros
+    if i == 0:
+        break
+    # We know g is odd now
+    if eta < 0:
+        eta, f, g = -eta, g, -f
+    g += f
+    # g is even now, and the eta decrement and g shift will happen in the next loop.
+```
+
+We can now remove multiple bottom *0* bits from *g* at once, but still need a full iteration whenever
+there is a bottom *1* bit. In what follows, we will get rid of multiple *1* bits simultaneously as
+well.
+
+Observe that as long as *&eta; &geq; 0*, the loop does not modify *f*. Instead, it cancels out bottom
+bits of *g* and shifts them out, and decreases *&eta;* and *i* accordingly - interrupting only when *&eta;*
+becomes negative, or when *i* reaches *0*. Combined, this is equivalent to adding a multiple of *f* to
+*g* to cancel out multiple bottom bits, and then shifting them out.
+
+It is easy to find what that multiple is: we want a number *w* such that *g+w&thinsp;f* has a few bottom
+zero bits. If that number of bits is *L*, we want *g+w&thinsp;f mod 2<sup>L</sup> = 0*, or *w = -g/f mod 2<sup>L</sup>*. Since *f*
+is odd, such a *w* exists for any *L*. *L* cannot be more than *i* steps (as we'd finish the loop before
+doing more) or more than *&eta;+1* steps (as we'd run `eta, f, g = -eta, g, -f` at that point), but
+apart from that, we're only limited by the complexity of computing *w*.
+
+This code demonstrates how to cancel up to 4 bits per step:
+
+```python
+NEGINV16 = [15, 5, 3, 9, 7, 13, 11, 1] # NEGINV16[n//2] = (-n)^-1 mod 16, for odd n
+i = N
+while True:
+    zeros = min(i, count_trailing_zeros(g))
+    eta -= zeros
+    g >>= zeros
+    i -= zeros
+    if i == 0:
+        break
+    # We know g is odd now
+    if eta < 0:
+        eta, f, g = -eta, g, -f
+    # Compute limit on number of bits to cancel
+    limit = min(min(eta + 1, i), 4)
+    # Compute w = -g/f mod 2**limit, using the table value for -1/f mod 2**4. Note that f is
+    # always odd, so its inverse modulo a power of two always exists.
+    w = (g * NEGINV16[(f & 15) // 2]) % (2**limit)
+    # As w = -g/f mod (2**limit), g+w*f mod 2**limit = 0 mod 2**limit.
+    g += w * f
+    assert g % (2**limit) == 0
+    # The next iteration will now shift out at least limit bottom zero bits from g.
+```
+
+By using a bigger table more bits can be cancelled at once. The table can also be implemented
+as a formula. Several formulas are known for computing modular inverses modulo powers of two;
+some can be found in Hacker's Delight second edition by Henry S. Warren, Jr. pages 245-247.
+Here we need the negated modular inverse, which is a simple transformation of those:
+
+- Instead of a 3-bit table:
+  - *-f* or *f ^ 6*
+- Instead of a 4-bit table:
+  - *1 - f(f + 1)*
+  - *-(f + (((f + 1) & 4) << 1))*
+- For larger tables the following technique can be used: if *w=-1/f mod 2<sup>L</sup>*, then *w(w&thinsp;f+2)* is
+  *-1/f mod 2<sup>2L</sup>*. This allows extending the previous formulas (or tables). In particular we
+  have this 6-bit function (based on the 3-bit function above):
+  - *f(f<sup>2</sup> - 2)*
+
+This loop, again extended to also handle *u*, *v*, *q*, and *r* alongside *f* and *g*, placed in
+`divsteps_n_matrix`, gives a significantly faster, but non-constant time version.
+
+
+## 7. Final Python version
+
+All together we need the following functions:
+
+- A way to compute the transition matrix in constant time, using the `divsteps_n_matrix` function
+  from section 2, but with its loop replaced by a variant of the constant-time divstep from
+  section 5, extended to handle *u*, *v*, *q*, *r*:
+
+```python
+def divsteps_n_matrix(zeta, f, g):
+    """Compute zeta and transition matrix t after N divsteps (multiplied by 2^N)."""
+    u, v, q, r = 1, 0, 0, 1 # start with identity matrix
+    for _ in range(N):
+        c1 = zeta >> 63
+        # Compute x, y, z as conditionally-negated versions of f, u, v.
+        x, y, z = (f ^ c1) - c1, (u ^ c1) - c1, (v ^ c1) - c1
+        c2 = -(g & 1)
+        # Conditionally add x, y, z to g, q, r.
+        g, q, r = g + (x & c2), q + (y & c2), r + (z & c2)
+        c1 &= c2                     # reusing c1 here for the earlier c3 variable
+        zeta = (zeta ^ c1) - 1       # inlining the unconditional zeta decrement here
+        # Conditionally add g, q, r to f, u, v.
+        f, u, v = f + (g & c1), u + (q & c1), v + (r & c1)
+        # When shifting g down, don't shift q, r, as we construct a transition matrix multiplied
+        # by 2^N. Instead, shift f's coefficients u and v up.
+        g, u, v = g >> 1, u << 1, v << 1
+    return zeta, (u, v, q, r)
+```
+
+- The functions to update *f* and *g*, and *d* and *e*, from section 2 and section 4, with the constant-time
+  changes to `update_de` from section 5:
+
+```python
+def update_fg(f, g, t):
+    """Multiply matrix t/2^N with [f, g]."""
+    u, v, q, r = t
+    cf, cg = u*f + v*g, q*f + r*g
+    return cf >> N, cg >> N
+
+def update_de(d, e, t, M, Mi):
+    """Multiply matrix t/2^N with [d, e], modulo M."""
+    u, v, q, r = t
+    d_sign, e_sign = d >> 257, e >> 257
+    md, me = (u & d_sign) + (v & e_sign), (q & d_sign) + (r & e_sign)
+    cd, ce = (u*d + v*e) % 2**N, (q*d + r*e) % 2**N
+    md -= (Mi*cd + md) % 2**N
+    me -= (Mi*ce + me) % 2**N
+    cd, ce = u*d + v*e + M*md, q*d + r*e + M*me
+    return cd >> N, ce >> N
+```
+
+- The `normalize` function from section 4, made constant time as well:
+
+```python
+def normalize(sign, v, M):
+    """Compute sign*v mod M, where v in (-2*M,M); output in [0,M)."""
+    v_sign = v >> 257
+    # Conditionally add M to v.
+    v += M & v_sign
+    c = (sign - 1) >> 1
+    # Conditionally negate v.
+    v = (v ^ c) - c
+    v_sign = v >> 257
+    # Conditionally add M to v again.
+    v += M & v_sign
+    return v
+```
+
+- And finally the `modinv` function too, adapted to use *&zeta;* instead of *&delta;*, and using the fixed
+  iteration count from section 5:
+
+```python
+def modinv(M, Mi, x):
+    """Compute the modular inverse of x mod M, given Mi=1/M mod 2^N."""
+    zeta, f, g, d, e = -1, M, x, 0, 1
+    for _ in range((590 + N - 1) // N):
+        zeta, t = divsteps_n_matrix(zeta, f % 2**N, g % 2**N)
+        f, g = update_fg(f, g, t)
+        d, e = update_de(d, e, t, M, Mi)
+    return normalize(f, d, M)
+```
+
+- To get a variable time version, replace the `divsteps_n_matrix` function with one that uses the
+  divsteps loop from section 5, and a `modinv` version that calls it without the fixed iteration
+  count:
+
+```python
+NEGINV16 = [15, 5, 3, 9, 7, 13, 11, 1] # NEGINV16[n//2] = (-n)^-1 mod 16, for odd n
+def divsteps_n_matrix_var(eta, f, g):
+    """Compute eta and transition matrix t after N divsteps (multiplied by 2^N)."""
+    u, v, q, r = 1, 0, 0, 1
+    i = N
+    while True:
+        zeros = min(i, count_trailing_zeros(g))
+        eta, i = eta - zeros, i - zeros
+        g, u, v = g >> zeros, u << zeros, v << zeros
+        if i == 0:
+            break
+        if eta < 0:
+            eta, f, u, v, g, q, r = -eta, g, q, r, -f, -u, -v
+        limit = min(min(eta + 1, i), 4)
+        w = (g * NEGINV16[(f & 15) // 2]) % (2**limit)
+        g, q, r = g + w*f, q + w*u, r + w*v
+    return eta, (u, v, q, r)
+
+def modinv_var(M, Mi, x):
+    """Compute the modular inverse of x mod M, given Mi = 1/M mod 2^N."""
+    eta, f, g, d, e = -1, M, x, 0, 1
+    while g != 0:
+        eta, t = divsteps_n_matrix_var(eta, f % 2**N, g % 2**N)
+        f, g = update_fg(f, g, t)
+        d, e = update_de(d, e, t, M, Mi)
+    return normalize(f, d, Mi)
+```
+
+## 8. From GCDs to Jacobi symbol
+
+We can also use a similar approach to calculate Jacobi symbol *(x | M)* by keeping track of an
+extra variable *j*, for which at every step *(x | M) = j (g | f)*. As we update *f* and *g*, we
+make corresponding updates to *j* using
+[properties of the Jacobi symbol](https://en.wikipedia.org/wiki/Jacobi_symbol#Properties):
+* *((g/2) | f)* is either *(g | f)* or *-(g | f)*, depending on the value of *f mod 8* (negating if it's *3* or *5*).
+* *(f | g)* is either *(g | f)* or *-(g | f)*, depending on *f mod 4* and *g mod 4* (negating if both are *3*).
+
+These updates depend only on the values of *f* and *g* modulo *4* or *8*, and can thus be applied
+very quickly, as long as we keep track of a few additional bits of *f* and *g*. Overall, this
+calculation is slightly simpler than the one for the modular inverse because we no longer need to
+keep track of *d* and *e*.
+
+However, one difficulty of this approach is that the Jacobi symbol *(a | n)* is only defined for
+positive odd integers *n*, whereas in the original safegcd algorithm, *f, g* can take negative
+values. We resolve this by using the following modified steps:
+
+```python
+        # Before
+        if delta > 0 and g & 1:
+            delta, f, g = 1 - delta, g, (g - f) // 2
+
+        # After
+        if delta > 0 and g & 1:
+            delta, f, g = 1 - delta, g, (g + f) // 2
+```
+
+The algorithm is still correct, since the changed divstep, called a "posdivstep" (see section 8.4
+and E.5 in the paper) preserves *gcd(f, g)*. However, there's no proof that the modified algorithm
+will converge. The justification for posdivsteps is completely empirical: in practice, it appears
+that the vast majority of nonzero inputs converge to *f=g=gcd(f<sub>0</sub>, g<sub>0</sub>)* in a
+number of steps proportional to their logarithm.
+
+Note that:
+- We require inputs to satisfy *gcd(x, M) = 1*, as otherwise *f=1* is not reached.
+- We require inputs *x &neq; 0*, because applying posdivstep with *g=0* has no effect.
+- We need to update the termination condition from *g=0* to *f=1*.
+
+We account for the possibility of nonconvergence by only performing a bounded number of
+posdivsteps, and then falling back to square-root based Jacobi calculation if a solution has not
+yet been found.
+
+The optimizations in sections 3-7 above are described in the context of the original divsteps, but
+in the C implementation we also adapt most of them (not including "avoiding modulus operations",
+since it's not necessary to track *d, e*, and "constant-time operation", since we never calculate
+Jacobi symbols for secret data) to the posdivsteps version.