# Optimal Metastability-Containing Sorting via Parallel Prefix Computation\*

Johannes Bund, Christoph Lenzen, Moti Medina

Abstract—Friedrichs et al. (TC 2018) showed that metastability can be contained when sorting inputs arising from time-to-digital converters, i.e., measurement values can be correctly sorted without resolving metastability using synchronizers first. However, this work left open whether this can be done by small circuits. We show that this is indeed possible, by providing a circuit that sorts Gray code inputs (possibly containing a metastable bit) and has asymptotically optimal depth and size.

Our solution utilizes the parallel prefix computation (PPC) framework (JACM 1980). We improve this construction by bounding its fan-out by an arbitrary  $f \ge 3$ , without affecting depth and increasing circuit size by a small constant factor only. Thus, we obtain the first PPC circuits with asymptotically optimal size, constant fan-out, and optimal depth.

To show that applying the PPC framework to the sorting task is feasible, we prove that the latter can, despite potential metastability, be decomposed such that the core operation is associative. We obtain asymptotically optimal metastability-containing sorting networks. We complement these results with simulations, independently verifying the correctness as well as small size and delay of our circuits.

### INTRODUCTION

Metastability is a fundamental obstacle when crossing clock domains, potentially resulting in soft errors with critical consequences [14]. As it has been shown that metastability cannot be avoided deterministically [25], synchronizers [19] are employed to reduce the error probability to tolerable levels. This approach trades precious time for reliability: the more time is allocated for metastability resolution, the smaller the probability of metastability-induced faults.

Recently, a different approach has been proposed, coined metastability-containing (MC) circuits [10]. It accepts a limited amount of metastability in the input to a digital circuit and ensures limited metastability of its output, so that the result is still useful. In a series of works [24], [3], [4], we applied this approach to a fundamental primitive: sorting. The circuit given in [4] is asymptotically optimal in depth and size.

Our Contribution: In this article, we present the machinery used to obtain the circuit from [4] in detail. We prove that CMOS implementations of basic gates realize Kleene logic (cf. [20, §64]), justifying the computational model introduced in [10] and used in this article.

The task of sorting an arbitrary number of inputs can be reduced to sorting two inputs by using sorting networks [21]. The 0-1-principle (cf. Section 2) shows that plugging an MC 2-sort(B) circuit (for B-bit inputs) into a sorting network (for n values) readily yields an MC circuit that is capable of sorting n inputs. Hence, we need to design a 2-sort(B) circuit sorting two inputs in an MC way.

As the choice of the encoding matters a lot for MC circuits, we characterize the set of input strings we want to

\*This article generalizes and extends work presented at DATE 2018 [4].

sort ("valid strings"). A valid string is either a (standard) Gray code string or a string obtained from a Gray code string by replacing the unique bit that would change on the up-count to the "next" codeword by M for metastability (the third logic value in Kleene logic). When using nonredundant codes, the use of Gray codes is mandatory: when converting an analog value to a digital one, continuously changing the input can force any circuit (that uses the value in a non-trivial way) into metastability [25]. Moreover, for combinational circuits in the abstraction of Kleene logic, all output bits that change when flipping a given input bit must become unstable when the input bit is unstable, cf. [10]. For instance, encoding a value unknown to be 11 or 12 in standard binary code would result in a string that, once metastability has been resolved, may represent any number in the interval from 8 to 15, cf. Section 3.

Valid strings arise naturally when stopping a Gray code counter asynchronously [12] or, more generally, whenever performing analog-to-digital conversion; respective circuits may risk multiple metastable bits to achieve better averagecase precision, but for the best worst-case precision one can stick to guaranteeing valid strings as output. Exploiting the structure of Gray code and the restriction to valid strings, we show how to reliably sort all inputs despite the uncertainty about the represented value arising from metastability.

We formally specify the 2-sort(B) circuit and then prove that the task of comparing two valid strings can be decomposed into first performing a four-valued comparison on each prefix pair of the two valid input strings, and then inferring the corresponding output bits. This reduces the design of 2-sort(B) to a parallel prefix computation (PPC) problem, which for our purposes can be phrased as follows.

**Definition 1.1** (PPC $_{\oplus}(B)$ ). For associative  $\oplus$ :  $D \times D \to D$ and  $B \in \mathbb{N}$ , a  $PPC_{\oplus}(B)$  circuit is specified as follows.

Input:  $d \in D^B$ Output:  $\pi \in D^B$ ,

Functionality:  $\pi_i = \bigoplus_{i=1}^i d_i$  for all  $i \in [1, B]$ .

Manuscript received Mmmm DD, YYYY; revised Mmmm DD, YYYY.

Johannes Bund and Christoph Lenzen are affiliated with the Max Planck Institute for Informatics, Saarland Informatics Campus, Germany, Email: { jbund, clenzen}@mpi-inf.mpg.de

Moti Medina is affiliated with the Department of Electrical & Computer Engineering, Ben-Gurion University of the Negev, Beer Sheva, Israel, Email: medinamo@bgu.ac.il

Fast PPC circuits that are simultaneously (asymptotically) optimal in depth and size are known due to a celebrated result by Ladner and Fischer [23]. Going beyond [4], we present the full range of solutions that can be derived using their framework, which allows for a trade-off between depth and size of the 2-sort circuit. Most prominently, optimizing for depth reduces the depth of the circuit by a factor of 2 compared to [4] to optimal  $\lceil \log B \rceil$ , at the expense of increasing the size by a factor of up to 2.

However, relying on the construction from [23] as-is results in a very large fan-out. We present a modification reducing fan-out to any number  $f \geq 3$  without affecting depth, increasing the size by a factor of only  $1 + \mathcal{O}(1/f)$  (plus at most 3B/2 buffers). In particular, our results imply that the depth of an MC sorting circuit can match the delay of a non-containing circuit, while maintaining constant fanout and a constant-factor size overhead. Due to the fact that PPC circuits lie at the heart of fast adders [27], we consider this result of independent interest.

We complement our theoretical findings by simulations confirming the correctness and small size of the devised circuits. Post-layout area and delay of the designed circuits compare favorably with a baseline provided by a straightforward non-containing implementation.

Organization of this Article: We discuss related work in Section 2. Some preliminaries, the computational model and its justification, as well as the problem specification are given in Section 3. Next, in Section 4, we break the task of designing a 2-sort(B) circuit down into comparing prefixes and subsequently generating the output bits out of the computed comparison values and the respective pair of input bits. The comparison can be further decomposed into sequential application of an associative operator, which enables application of the PPC framework to compute all prefixes efficiently in parallel with (asymptotically) optimal depth. In order to keep this article self-contained, we compactly review the PPC framework in Section 5. The section then proceeds to showing how to modify the construction for bounded fan-out and bounding the size of the resulting circuits. In Section 6, we implement the base operators by subcircuits and plug the pieces together to obtain complete circuits. We then simulate them up to an input width of B = 12 to independently verify their correctness, and provide delay and area of the laid out circuits. We compare to a non-containing version as baseline, demonstrating the controlled increase in size of the circuit. We conclude the article in Section 7, where we also briefly discuss followup work that generalizes our results, demonstrating that higher-level concepts of this work like sorting networks and parallel prefix computation are applicable to further MC circuits. Some proofs are sketched or omitted; the article with full proofs is provided as online supplemental material.

#### 2 RELATED WORK

**Sorting Networks**: Sorting networks (see, e.g., [21]) sort n inputs from a totally ordered universe by feeding them into n parallel wires that are connected by 2-sort elements, i.e., subcircuits sorting two inputs; these can act in parallel whenever they do not depend on each other's output. A correct sorting network sorts all possible inputs,

i.e., the wires are labeled 1 to n such that the  $i^{th}$  wire outputs the  $i^{th}$  element of the sorted list of inputs. The size of a sorting network is its number of 2-sort elements and its depth is the maximum number of 2-sort elements an input may pass through until reaching the output.

The 0-1-principle [21] states that a sorting network—assuming the 2-sort circuits are correct—is correct if and only if it sorts 0-1 inputs correctly. Thus, we obtain sorting networks for inputs that may suffer from metastability by constructing 2-sort circuits (w.r.t. a suitable order on such inputs) and plugging them into existing sorting networks.

Sorting networks have been extensively studied. Tight lower bounds of depth  $\Omega(\log n)$  (trivial) and size  $\Omega(n\log n)$  (see, e.g., [8]) are known and can be simultaneously asymptotically matched [1]. More practically, for small values of n optimal depth and/or size networks are known [6], [7], [21]. Accordingly, our task boils down to finding optimal (or close to optimal) metastability-containing 2-sort circuits. For B-bit inputs, our 2-sort circuits have depth and size  $\mathcal{O}(\log B)$  and  $\mathcal{O}(B)$ , respectively, which is (trivially) optimal up to constants; as size and depth of our circuits are close to non-containing 2-sort circuits (cf. Table 12), we conclude that our approach yields MC sorting networks that are optimal up to small constant factors in both depth and size.

**Prior Work on MC Circuits**: Recent work [10] shows that for any Boolean function a combinational MC circuit implementing its *metastable closure* (see Definition 3.8) exists. The metastable closure can be seen as a best effort to contain metastability: when for an input with (some) metastable bits the stable input bits already determine a given output bit of the original Boolean function, the closure attains the respective value on this output bit; otherwise it is metastable.

Unfortunately, the proof from [10], which uses a construction dating back to Huffman [16], yields circuits of exponential size in the number of input bits B. The same is true for speculative computing [28]. Unconditional lower bounds on MC circuits [17] show that this cannot be avoided in general, even if the implemented function admits a small non-containing circuit. The same work provides, assuming that at most k input bits can be metastable, a construction with multiplicative  $B^{\mathcal{O}(k)}$  and additive  $\mathcal{O}(k\log n)$  overheads in size and depth, respectively. For the 2-sort element, k=2 (each Gray code string may contain one metastable bit), but the resulting circuits are still far from optimal.

In [10], an alternative construction relying on non-combinational logic is given, achieving (up to minor-order terms) factor 2k+1 increase in size and additive  $\Theta(\log k)$  increase in depth of the resulting circuit; for a 2-sort circuit, k=2, so these overheads are constant. Rule-of-thumb calculations suggest that optimized versions of the circuits presented here and derived by this method would have comparable performance. A fair and detailed comparison would require fully-fledged designs of both approaches, which is beyond the scope of this article. Note, however, that our design has the advantage of being purely combinational.

**Parallel Prefix Computation**: Ladner and Fischer [23] studied the parallel application of an associative operator to all prefixes of an input string of length  $\ell$  (over an arbitrary alphabet). They give parallel prefix computation (PPC) circuits of depth  $\mathcal{O}(\log \ell)$  and size  $\mathcal{O}(\ell)$  (where the circuit implementing the operator is assumed to have size and

depth 1). However, when requiring optimal depth of  $\lceil \log \ell \rceil$ , their corresponding solution suffers from fan-out larger than  $\ell/2$ . An earlier construction by Kogge and Stone [22] simultaneously achieves optimal depth and fan-out of 2. This yields the fastest adder circuits to date (cf. [27]), but at the expense of a large size of  $\ell(\lceil \log \ell \rceil - 1) + 1$ . A number of additional constructions have been developed for adders, including special cases ([2], [26]) of the one by Ladner and Fischer, cf. [31]. However, no other construction achieves asymptotically optimal depth and size.

#### 3 Model and Problem

In this section, we discuss how to model metastability in a worst-case fashion and formally specify the input/output behavior of our circuits. Our model is a simplified version of the one from [10] for combinational circuits (cf. [9, Chap. 7]). This means to represent metastable "bits" by M and extend truth tables as in Kleene's 3-valued logic [20, §64].

**Basic Notation**: We set  $[N]:=\{0,\ldots,N-1\}$  for  $N\in\mathbb{N}$  and  $[i,j]=\{i,i+1,\ldots,j\}$  for  $i,j\in\mathbb{N},$   $i\leq j$ . We denote  $\mathbb{B}:=\{0,1\}$  and  $\mathbb{B}_{\mathbb{M}}:=\{0,1,\mathbb{M}\}$ . For a B-bit string  $g\in\mathbb{B}_{\mathbb{M}}^B$  and  $i\in[1,B]$ , denote by  $g_i$  its i-th bit, i.e.,  $g=g_1g_2\ldots g_B$ . We use the shorthand  $g_{i,j}:=g_i\ldots g_j$ , where  $i,j\in[1,B]$  and  $i\leq j$ . Let  $\operatorname{par}(g)$  denote the parity of  $g\in\mathbb{B}^B$ , i.e,  $\operatorname{par}(g)=\sum_{i=1}^B g_i \bmod 2$ . For a function f and a set f0 we abbreviate f1 and f2.

# 3.1 Binary Reflected Gray Code

A standard binary representation of inputs is unsuitable: uncertainty of the input values may be arbitrarily amplified by the encoding. E.g. representing a value unknown to be 11 or 12, which are encoded as 1011 resp. 1100, would result in the bit string  $1 \mbox{\sc MMM}$ , i.e., a string that is metastable in every position that differs for both strings. However,  $1 \mbox{\sc MMM}$  may represent any number in the interval from 8 to 15, amplifying the initial uncertainty of being in the interval from 11 to 12. An encoding that does not lose precision for consecutive values is Gray code.

We use B-bit binary reflected Gray code,  $rg_B:[N]\to\mathbb{B}^B$ , which is defined recursively. For simplicity (and without loss of generality) we set  $N:=2^B$ . A 1-bit code is given by  $\mathrm{rg}_1(0)=0$  and  $\mathrm{rg}_1(1)=1$ . For B>1, we start with the first bit fixed to 0 and counting with  $\mathrm{rg}_{B-1}(\cdot)$  (for the first  $2^{B-1}$  codewords), then toggle the first bit to 1, and finally "count down"  $\mathrm{rg}_{B-1}(\cdot)$  while fixing the first bit again, cf. Table 1. Formally, this yields for  $x\in[N]$ 

$$\operatorname{rg}_B(x) := \begin{cases} 0 \operatorname{rg}_{B-1}(x) & \text{if } x \in [2^{B-1}] \\ 1 \operatorname{rg}_{B-1}(2^B - 1 - x) & \text{if } x \in [2^B] \setminus [2^{B-1}] \ . \end{cases}$$

As each B-bit string is a codeword, the code is a bijection and the encoding function also defines the decoding function. Denote by  $\langle \cdot \rangle : \mathbb{B}^B \to [N]$  the decoding function of a Gray code string, i.e., for  $x \in [N]$ ,  $\langle \operatorname{rg}_B(x) \rangle = x$ .

For two binary reflected Gray code strings  $g,h\in\mathbb{B}^B$ , we define their maximum and minimum as

$$(\max^{\operatorname{rg}}\{g,h\},\min^{\operatorname{rg}}\{g,h\}) := \begin{cases} (g,h) & \text{if } \langle g \rangle \geq \langle h \rangle \\ (h,g) & \text{if } \langle g \rangle < \langle h \rangle \;. \end{cases}$$

For example:

TABLE 1: 4-bit binary reflected Gray code

| # | $g_1, g_{2,4}$ | # | $g_1, g_{2,4}$ | #  | $g_1,g_{2,4}$ | #  | $g_1, g_{2,4}$ |
|---|----------------|---|----------------|----|---------------|----|----------------|
| 0 | 0 000          | 4 | 0 110          | 8  | 1 100         | 12 | 1 010          |
| 1 | 0 001          | 5 | 0 111          | 9  | 1 101         | 13 | 1 011          |
| 2 | 0.011          | 6 | 0 101          | 10 | $1\ 1111$     | 14 | 1 001          |
| 3 | 0 010          | 7 | 0 100          | 11 | 1 110         | 15 | 1 000          |

TABLE 2: 4-bit valid inputs

| g    | $ \langle g \rangle $ | g    | $\langle g \rangle$ | g    | $\langle g \rangle$ | g    | $\langle g \rangle$ |
|------|-----------------------|------|---------------------|------|---------------------|------|---------------------|
| 0000 | 0                     | 0110 | 4                   | 1100 | 8                   | 1010 | 12                  |
| M000 | -                     | 011M | _                   | 110M | _                   | 101M | _                   |
| 0001 | 1                     | 0111 | 5                   | 1101 | 9                   | 1011 | 13                  |
| 00M1 | -                     | 01M1 | _                   | 11M1 | _                   | 10M1 | _                   |
| 0011 | 2                     | 0101 | 6                   | 1111 | 10                  | 1001 | 14                  |
| 001M | -                     | 010M | _                   | 111M | _                   | 100M | _                   |
| 0010 | 3                     | 0100 | 7                   | 1110 | 11                  | 1000 | 15                  |
| 0M10 | _                     | M100 | _                   | 1M10 | _                   | _    | _                   |

- $\max^{\text{rg}} \{0011, 0100\} = \max^{\text{rg}} \{\text{rg}_B(2), \text{rg}_B(7)\} = 0100$ ,
- $\min^{\text{rg}} \{0111, 0101\} = \min^{\text{rg}} \{ \operatorname{rg}_B(9), \operatorname{rg}_B(10) \} = 0111.$

# 3.2 Valid Strings

The inputs to the sorting circuit may have some metastable bits, which means that the respective signals behave out-of-spec from the perspective of Boolean logic. Such inputs, referred to as *valid strings*, are introduced with the help of the following operator.

**Definition 3.1** (\* Operator). For  $B \in \mathbb{N}$ , define the operator  $*: \mathbb{B}^B_{\mathbb{M}} \times \mathbb{B}^B_{\mathbb{M}} \to \mathbb{B}^B_{\mathbb{M}}$  by

$$\forall i \in \{1, \dots, B\} : (x * y)_i := \begin{cases} x_i & \text{if } x_i = y_i \\ \mathbb{M} & \text{else.} \end{cases}$$

**Observation 3.2.** The operator \* is associative and commutative. Hence, for a set  $S = \{x^{(1)}, \dots, x^{(k)}\}$  of B-bit strings, we can use the shorthand  $*S := *_{x \in S} x := x^{(1)} *_x^{(2)} *_{\dots} *_x^{(k)}$ . We call \*S the superposition of the strings in S.

Valid strings have at most one metastable bit. If this bit resolves to either 0 or 1, the resulting string encodes either x or x+1 for some x, cf. Table 2.

**Definition 3.3** (Valid Strings). Let  $B \in \mathbb{N}$  and  $N = 2^B$ . Then, the set of valid strings of length B is

$$\mathcal{S}_{\mathrm{rg}}^B := \mathrm{rg}_B([N]) \cup \bigcup_{x \in [N-1]} \{ \mathrm{rg}_B(x) * \mathrm{rg}_B(x+1) \} .$$

#### 3.3 Resolution and Closure

To extend the specification of max<sup>rg</sup> and min<sup>rg</sup> to valid strings, we make use of the *metastable closure* [10]. The metastable closure is defined over the possible *resolutions* of metastable bits.

**Definition 3.4** (Resolution [10]). For  $x \in \mathbb{B}^B_{\mathsf{M}}$ , define the resolution  $\operatorname{res}(x) : \mathbb{B}^B_{\mathsf{M}} \to \mathcal{P}\left(\mathbb{B}^B\right)$  as follows:

$$res(x) := \{ y \in \mathbb{B}^B \mid \forall i \in \{1, \dots, B\} \colon x_i \neq \mathbb{M} \Rightarrow y_i = x_i \} .$$

Thus, res(x) is the set of all strings obtained by replacing all Ms in x by either 0 or 1: M acts as a "wild card." For any x and y, we have that res(xy) = res(x) res(y).

We note two observations for later use.

**Observation 3.5.** For any  $x \in \mathbb{B}_{M}^{B}$ ,  $* \operatorname{res}(x) = x$ .

*Proof.* Let  $x \in \mathbb{B}_{\mathbb{M}}^B$  and let I be the set of indices where x is stable, i.e.,  $i \in I$  iff  $x_i \neq M$ . From Definition 3.4, we get that

$$\forall i \in \{1 \dots B\} : i \in I \Leftrightarrow \{x_i\} = \operatorname{res}(x_i) .$$

By Definition 3.1 and Observation 3.2,

$$\forall i \in \{1, \dots, B\} : \{(*\operatorname{res}(x))_i\} = \begin{cases} \operatorname{res}(x_i) & \text{if } i \in I \\ \{M\} & \text{else.} \end{cases}$$

This entails the claim  $* \operatorname{res}(x) = x$ .

For example:  $* \operatorname{res}(0 M10) = *\{0010, 0110\} = 0 M10$ .

**Observation 3.6.** For  $\emptyset \neq S \subseteq \mathbb{B}^B$ , we have  $S \subseteq res(*S)$ .

*Proof.* Let  $\emptyset \neq S \subseteq \mathbb{B}^B$  and  $s \in S$ . Define I as the set of indices where \*S is stable, i.e.,

$$\forall i \in \{1 \dots B\} : i \in I \Leftrightarrow (*S)_i \neq M.$$

From Definition 3.1 and Observation 3.2, we conclude for  $i \in \{1 \dots B\}$ :

$$(*S)_i = \begin{cases} s_i & \text{if } i \in I \\ M & \text{else.} \end{cases}$$

Since by Definition 3.4 each combination of replacing Ms by 0s and 1s occurs in res(\*S), we conclude that

$$\exists x \in res(*S) : \forall i \notin I : x_i = s_i.$$

Since  $\forall i \in \{1 \dots B\} : i \in I \Leftrightarrow \{x_i\} = \operatorname{res}(x_i), s \in \operatorname{res}(*S).$ This proves the claim  $S \subseteq res(*S)$ .

We observe that in general the reverse direction does not hold, i.e.,  $res(*S) \nsubseteq S$ . For example, consider  $S = \{01, 10\}$ and thus \*S = MM such that  $res(*S) = \{00, 01, 10, 11\} =$  $\mathbb{B}^2$ . Hence,  $S \subseteq \operatorname{res}(*S)$  but not  $\operatorname{res}(*S) \subseteq S$ . In contrast, for  $|\operatorname{res}(*S)| \leq 2$ , we can see that the reverse direction holds.

**Observation 3.7.** For any subset of strings  $S \subseteq \mathbb{B}^B$ , if  $|\operatorname{res}(*S)| \leq 2$ , then  $\operatorname{res}(*S) = S$ .

*Proof.* Since \*S can contain at most one M bit, we know that S can contain at most two strings that differ in one position. It is then straightforward to show that every string in res(\*S) is element of S. Together with Observation 3.6 this shows the equality.

The metastable closure of an operator on binary inputs extends it to inputs that may contain metastable bits. This is done by considering all resolutions of the inputs, applying the operator, and taking the superposition of the results.

Definition 3.8 (The M Closure [10]). Given an operator  $f: \mathbb{B}^n \to \mathbb{B}^m$ , its metastable closure  $f_{\mathbb{M}}: \mathbb{B}^n_{\mathbb{M}} \to \mathbb{B}^m_{\mathbb{M}}$  is defined by  $f_{\mathbb{M}}(x) := *\{f(x')|x' \in \operatorname{res}(x)\}$ . Recalling the basic notation we abbreviate this by  $f_{\mathbb{M}}(x) = *f(\operatorname{res}(x))$ .

The closure is the best one can achieve w.r.t. containing metastability with clocked logic using standard registers [10], i.e., when  $f_{\mathbb{M}}(x)_i = \mathbb{M}$ , no such implementation can guarantee that the  $i^{th}$  output bit stabilizes in a timely fashion.

# 3.4 Output Specification

We want to construct a circuit computing the maximum and minimum of two valid strings, enabling us to build sorting networks for valid strings. First, however, we need to answer the question what it means to ask for the maximum or minimum of valid strings. To this end, suppose a valid string is  $\operatorname{rg}_{B}(x) * \operatorname{rg}_{B}(x+1)$  for some  $x \in [N-1]$ , i.e., the string contains a metastable bit that makes it uncertain whether the represented value is x or x + 1. If we wait for metastability to resolve, the string will stabilize to either  $\operatorname{rg}_B(x)$  or  $\operatorname{rg}_B(x+1)$ . Accordingly, it makes sense to consider  $rg_B(x) * rg_B(x + 1)$  "in between"  $rg_B(x)$  and  $rg_B(x + 1)$ , resulting in the following total order on valid strings (cf. Ta-

**Definition 3.9** ( $\prec$ ). We define a total order  $\prec$  on valid strings as follows. For  $g,h\in\mathbb{B}^B$ ,  $g\prec h\Leftrightarrow \langle g\rangle <\langle h\rangle$ . For each  $x\in$ [N-1], we define  $\operatorname{rg}_B(x) \prec \operatorname{rg}_B(x) * \operatorname{rg}_B(x+1) \prec \operatorname{rg}_B(x+1)$ . We extend the resulting relation on  $\mathcal{S}^B_{\operatorname{rg}} \times \mathcal{S}^B_{\operatorname{rg}}$  to a total order by taking the transitive closure. Note that this also defines ≤, via  $g \leq h \Leftrightarrow (g = h \vee g \prec h).$ 

We intend to sort with respect to this order. It turns out that implementing a 2-sort circuit w.r.t. this order amounts to implementing the metastable closure of max<sup>rg</sup> and min<sup>rg</sup>.

**Lemma 3.10.** Let  $g, h \in \mathcal{S}_{rg}^B$ . Then

$$g \leq h \Leftrightarrow (\max_{M}^{rg} \{g, h\}, \min_{M}^{rg} \{g, h\}) = (h, g)$$
.

*Proof.* If  $g \prec h$ , Definitions 3.3 and 3.9 imply for all  $g' \in \operatorname{res}(g)$  and all  $h' \in \operatorname{res}(h)$  that  $g' \leq h'$  (cf. Table 2). Observation 3.5 shows that  $* \operatorname{res}(g) = g$  for any  $g \in \mathcal{S}_{\operatorname{rg}}^B$ . From Definition 3.8, we can thus conclude that  $\max_{M}^{rg} \{g, h\} =$  $*\operatorname{res}(h) = h \text{ and } \min_{\mathbb{M}}^{\operatorname{rg}} \{g, h\} = *\operatorname{res}(g) = g.$ 

If  $h \prec g$ , analogous reasoning shows that

$$(\max_{M}^{rg} \{g, h\}, \min_{M}^{rg} \{g, h\}) = (g, h) \neq (h, g).$$

The remaining case is that g = h. In the case where g does not contain an M bit, we have  $\max_{\mathbb{M}}^{\operatorname{rg}}\{g,h\} = \max^{\operatorname{rg}}\{g,h\} =$ g=h. Considering the second case  $(g=h=\operatorname{rg}_B(x)*\operatorname{rg}_B(x+1),$  for  $x\in[2^B-1])$ , we get that  $\max_{\mathbb{M}}^{\operatorname{rg}}\{g,h\}=$  $*\{rg_B(x), rg_B(x+1)\} = rg_B(x)*rg_B(x+1) = h$ . Likewise,  $\min_{\mathbb{M}}^{\operatorname{rg}}\{g,h\} = h = g.$ 

In other words,  $\max_{\tt M}^{\tt rg}$  and  $\min_{\tt M}^{\tt rg}$  are the  $\max$  and  $\min$ operators w.r.t. the total order on valid strings shown in Table 2, e.g.,

- $\max_{M}^{rg} \{1001, 1000\} = rg_4(15) = 1000$ ,
- $\bullet \ \max_{\mathtt{M}}^{rg} \{\mathtt{0M10}, \mathtt{0010}\} = rg_4(3) * rg_4(4) = \mathtt{0M10}$  , and
- $\max_{M}^{rg} \{0M10, 0110\} = rg_4(4) = 0110.$

Hence, our task is to implement  $\max_{M}^{rg}$  and  $\min_{M}^{rg}$ .

**Definition 3.11** (2-sort(B)). For  $B \in \mathbb{N}$ , a 2-sort(B) circuit is specified as follows.

Input:  $g,h \in \mathcal{S}_{\mathrm{rg}}^B$ ,

Output:  $g', h' \in \mathcal{S}_{rg}^B$ , Functionality:  $g' = \max_{\mathbb{M}}^{rg} \{g, h\}$ ,  $h' = \min_{\mathbb{M}}^{rg} \{g, h\}$ .

#### 3.5 Computational Model and CMOS Logic

We seek to use standard components and combinational logic only. We use the model of [10], which specifies the behavior of basic gates on metastable inputs via the metastable

TABLE 3: Extensions to metastable inputs of AND (left), OR (center), and an inverter (right) according to Kleene logic.

| a<br>b<br>0<br>1 | 0 | 1 | М | a<br>b | 0           | 1 | M |   | a   | а<br>1<br>0<br>м |
|------------------|---|---|---|--------|-------------|---|---|---|-----|------------------|
| $\frac{D}{0}$    | 0 | 0 | 0 | 0      | 0           | 1 | М | - | 0   | 1                |
| 1                | 0 | 1 | М | 1      | 0<br>1<br>M | 1 | 1 |   | 1   | 0                |
| М                | 0 | M | Μ | M      | М           | 1 | М |   | ΙΔΙ | I <sub>I</sub>   |

closure of their behavior on binary inputs, cf. Table 3. We use the standard notational convention that  $a+b=\mathrm{OR}_{\mathbb{M}}(a,b)$  and  $ab=\mathrm{AND}_{\mathbb{M}}(a,b)$ .

Note that in this logic, most familiar identities hold: AND and OR are associative, commutative, and distributive, and DeMorgan's laws hold. However, naturally the law of the excluded middle becomes void. For instance, in general,  $OR(x, \bar{x}) \neq 1$ , as OR(M, M) = M.

We now argue that basic CMOS gates behave according to this logic, justifying the model. For the sake of an intuitive notation, we apply some slightly unusual conventions. In the following, let  $R_1$  be a wildcard that can refer to any resistance that is "low", i.e., close to being negligible, as e.g. that of a transistor in its stable conducting state (i.e., any PMOS transistor subjected to a low gate voltage or any NMOS transistor subjected to a high gate voltage). Similar, denote by  $R_0$  any resistance that is "high", i.e., large compared to  $R_1$ , such as the resistance of a transistor in its stable non-conducting state. Thus, with a stable input  $b \in \mathbb{B}$  (where we identify 0 with low and 1 with high voltage), an NMOS transistor attains resistance  $R_b$ , while a PMOS transistor attains resistance  $R_{\bar{b}}$ . We can extend this to unstable inputs M by making the conservative assumption that  $R_{\mathbb{M}}$  is an arbitrary (possibly time-dependent) resistance.

With this notation, we can see that parallel and serial composition of transistors implements AND and OR in Kleene logic, respectively.

**Lemma 3.12.** For  $k \in \mathbb{N}$  sufficiently small so that  $kR_1 \ll R_0$ , let  $a_1, \ldots, a_k \in \mathbb{B}_{\mathbb{M}}$  be input signals fed to k NMOS transistors interconnected (i) in parallel or (ii) sequentially. Set  $\sigma := \sum_{i=1}^k a_i$  and  $\pi := \prod_{i=1}^k a_i$ , i.e., the OR resp. AND over all inputs. Then the resistance between input and output of the resulting subcircuit is (roughly) (i)  $R_{\sigma}$  resp. (ii)  $R_{\pi}$ .

*Proof.* Denote by R the resistance between the input and output of the subcircuit. Suppose first that  $\sigma=0$ , i.e.,  $a_i=0$  for all i. Then, for parallel composition, we get that  $1/R=\sum_{i=1}^k 1/R_{a_i}=k/R_0$ , yielding that  $R\geq R_0/k$ . On the other hand, if  $\sigma=1$ , there is an index i such that  $a_i=1$ , yielding for parallel composition that  $1/R=\sum_{i=1}^k 1/R_{a_i}\geq 1/R_1$  and thus  $R\leq R_1$ . This shows (i).

Now consider sequential composition and suppose first that  $\pi=1$ , i.e.,  $a_i=1$  for all i. Thus,  $R=\sum_{i=1}^k R_{a_i} \leq kR_1$ . In case  $\pi=0$ , there is some index i so that  $a_i=0$ , implying that  $R=\sum_{i=1}^k R_{a_i} \geq R_0$ .

The same arguments apply to PMOS transistors.

**Corollary 3.13.** For  $k \in \mathbb{N}$  sufficiently small so that  $kR_1 \ll R_0$ , let  $a_1, \ldots, a_k \in \mathbb{B}_{\mathbb{M}}$  be input signals fed to k PMOS transistors interconnected (i) in parallel or (ii) sequentially. Set  $\sigma := \sum_{i=1}^k \bar{a}_i$  and  $\pi := \prod_{i=1}^k \bar{a}_i$ , i.e., the OR resp. AND over



Fig. 1: Standard transistor-level implementations of inverter (left), NAND (center), and NOR (right) gates in CMOS technology. The latter can be turned into AND and OR, respectively, by appending an inverter.

all inputs. Then the resistance between input and output of the resulting subcircuit is (roughly) (i)  $R_{\sigma}$  resp. (ii)  $R_{\pi}$ .

We remark that the factor of k reduction in the gap between  $R_1$  and  $R_0$  may imply that a gate's output signal needs to be regenerated using a buffer. However, this is the same behavior as for logic that assumes stable signals only, so standard CMOS design techniques account for this.

From the above observations, we can readily infer that standard CMOS gate implementations behave according to Kleene logic in face of potentially metastable signals, justifying the model from [10].

**Theorem 3.14.** *The CMOS gates depicted in Figure 1 implement the truth tables given in Table 3.* 

*Proof.* The output of the gates is 1 (high voltage) if the resistances from  $V_{DD}$  and  $V_{SS}$  to the output are low (i.e., roughly  $R_1$ ) and high  $(R_0)$ , respectively. Similarly, it is 0 if the roles are reversed. Thus, Lemma 3.12 and Corollary 3.13 show the claim for stable entries of the truth tables. For the unstable ones, setting  $R_{\mathbb{M}}$  (which is a wildcard for an arbitrary resistance) to  $R_0$  or  $R_1$ , respectively, leads to different outcomes. Thus, the output voltage may attain almost any value between  $V_{DD}$  and  $V_{SS}$ , i.e., the output is  $\mathbb{M}$ .

Similar reasoning applies to many gates, e.g., NAND and NOR gates. We stress, however, that the property of implementing the closure of the function computed by the gate on stable values is not universal for CMOS logic. For instance, standard transistor-level multiplexer implementations do not handle metastability well, cf. [11].

#### 4 DECOMPOSITION OF THE TASK

In this section, we show that computing  $\max_{\mathbb{M}}^{\operatorname{rg}}\{g,h\}$  and  $\min_{\mathbb{M}}^{\operatorname{rg}}\{g,h\}$  for valid strings  $g,h\in\mathcal{S}^B_{\operatorname{rg}}$  can be broken down into composing simple operators in  $\mathbb{B}^2_{\mathbb{M}}\times\mathbb{B}^2_{\mathbb{M}}\to\mathbb{B}^2_{\mathbb{M}}$ .

#### 4.1 Comparing Stable Gray Codes via an FSM

Figure 2 depicts a finite state machine performing a four-valued comparison of two Gray code strings. In each step of processing inputs  $g, h \in \mathbb{B}^B$ , it is fed the pair of  $i^{th}$  input bits  $g_i h_i$ . In the following, we denote by  $s^{(i)}(g,h)$  the state of the machine after i steps, where  $s^{(0)}(g,h) \coloneqq 00$  is the starting state. For ease of notation, we will omit the arguments g



Fig. 2: Finite state machine determining which of two Gray code inputs  $g, h \in \mathbb{B}^B$  is larger. In each step, it receives  $g_i h_i$ as input. State encoding is given in square brackets.

TABLE 4: Run of the FSM on inputs g = 1001 and h = 1000

| i                                                 | 0  | 1  | 2  | 3  | 4  |
|---------------------------------------------------|----|----|----|----|----|
| $g_i h_i$                                         |    | 11 | 00 | 00 | 10 |
| $s^{(i)} = s^{(i-1)} \diamond g_i h_i$            | 00 | 11 | 11 | 11 | 01 |
| $g_i' = \operatorname{out}(s^{(i-1)}, g_i h_i)_1$ |    | 1  | 0  | 0  | 0  |
| $h_i' = \operatorname{out}(s^{(i-1)}, g_i h_i)_2$ |    | 1  | 0  | 0  | 1  |

and h of  $s^{(i)}$  whenever they are clear from context. Table 4 shows an example of a run of the finite state machine.

Because the parity keeps track of whether the remaining bits are to be compared w.r.t. the standard or "reflected" order, the state machine performs the comparison correctly w.r.t. the meaning of the states indicated in Figure 2.

**Lemma 4.1.** Let  $g, h \in \mathbb{B}^B$  and  $i \in [B+1]$ . Then

- $s^{(i)} = 00$  is equivalent to  $g_{1,i} = h_{1,i}$  and  $g \prec h$  if and only
- if  $g_{i+1,B} \prec h_{i+1,B}$ ,  $s^{(i)}=11$  is equivalent to  $g_{1,i}=h_{1,i}$  and  $g \prec h$  if and only  $if g_{i+1,B} \succ h_{i+1,B},$
- $s^{(i)} = 01$  is equivalent to  $g \prec h$ , and
- $s^{(i)} = 10$  is equivalent to g > h.

Proof. We show the claim by induction on i. It holds for i = 0, as  $s^{(i)} = 00$ ,  $g_{1,0} = h_{1,0}$  is the empty string, and  $g \prec h$  if and only if  $g_{1,B} = g \prec h = h_{1,B}$ . For the step from  $i-1 \in [B]$  to i, we make a case distinction based on  $s^{(i-1)}$ .  $s^{(i-1)} = 00$ : By the induction hypothesis,  $g_{1,i-1} = h_{1,i-1}$ and  $g \prec h$  if and only if  $g_{i,B} \prec h_{i,B}$ . Thus, if  $g_i h_i = 00$ ,  $s^{(i)} = 00$ ,  $g_{1,i} = h_{1,i}$ , and by the recursive definition of the code,  $g_{i,B} \prec h_{i,B} \Leftrightarrow g_{i+1,B} \prec h_{i+1,B}$ . Similarly, if  $g_i h_i = 11$ , also  $g_{1,i} = h_{1,i}$ , but the code for the remaining bits is "reflected," i.e.,  $g \prec h \Leftrightarrow g_{i+1,B} \succ h_{i+1,B}$ . If  $g_i h_i = 01$ , the definition implies that  $g \prec h$  regardless of further bits, and if  $g_i h_i = 10$ , g > h regardless of further bits.

 $s^{(i-1)} = 11$ : Analogously to the previous case, noting that reflecting a second time results in the original order.

 $s^{(i-1)} = 01$ : By the induction hypothesis,  $g \prec h$ . As 01 is an absorbing state, also  $s^{(i)} = 01$ .

 $s^{(i-1)} = 10$ : By the induction hypothesis, g > h. As 10 is an absorbing state, also  $s^{(i)} = 10$ .

This lemma gives rise to a sequential implementation of 2-sort(B) based on the given state machine, for input strings

TABLE 5: Computing  $\max^{rg} \{g, h\}_i$  and  $\min^{rg} \{g, h\}_i$  from the current state  $s^{(i-1)}$  and inputs  $g_i$  and  $h_i$ .

| $s^{(i-1)}$ | $\max^{\operatorname{rg}}\{g,h\}_i$ | $\min^{\mathrm{rg}}\{g,h\}_i$ |
|-------------|-------------------------------------|-------------------------------|
| 00          | $\max\{g_i,h_i\}$                   | $\min\{g_i, h_i\}$            |
| 10          | $g_i$                               | $h_i$                         |
| 11          | $\min\{g_i,h_i\}$                   | $\max\{g_i, h_i\}$            |
| 01          | $h_i$                               | $g_i$                         |

TABLE 6: Operators for next state and output. The first operand is the current state, the second is the next input.

(a) The  $\diamond$  operator.

| ♦  | 00   | 01 | 11 | 10 |
|----|------|----|----|----|
| 00 | 1 00 | 01 | 11 | 10 |
| 00 | 00   | 01 | 11 | 10 |
| 01 | 01   | 01 | 01 | 01 |
| 01 | 01   | 01 | 01 |    |
| 11 | 11   | 10 | 00 | 01 |
| 10 | 10   | 10 | 10 |    |
| 10 | 10   | 10 | 10 | 10 |

(b) The out operator.

| out | 00 | 01 | 11 | 10 |
|-----|----|----|----|----|
| 00  | 00 | 10 | 11 | 10 |
| 01  | 00 | 10 | 11 | 01 |
| 11  | 00 | 01 | 11 | 01 |
| 10  | 00 | 01 | 11 | 10 |

in  $\mathbb{B}^B$ . Table 5 lists the  $i^{th}$  output bit as function of  $s^{(i-1)}$ and the pair  $g_i h_i$ . Correctness of this computation follows immediately from Lemma 4.1.

We can express the transition function of the state machine as an (as easily verified) associative operator  $\diamond$  taking the current state and input  $g_ih_i$  as argument and returning the new state. Then  $s^{(i)}=s^{(i-1)}\diamond g_ih_i$ , where  $\diamond$  is given in Table 6a and  $s^{(0)} = 00$ . The out operator is derived from Table 5 by evaluating  $\max^{rg} \{g, h\}_i$  and  $\min^{rg} \{g, h\}_i$  for all possible values of  $g_i h_i \in \mathbb{B}^2$ . Noting that  $s^{(0)} \diamond x = 00 \diamond x = x$ for all  $x \in \mathbb{B}^2$ , we arrive at the following corollary.

**Corollary 4.2.** For all  $i \in [1, B]$ , we have that

$$\max^{\operatorname{rg}} \{g, h\}_i \min^{\operatorname{rg}} \{g, h\}_i = \operatorname{out} \left( \bigotimes_{j=1}^{i-1} g_j h_j, g_i h_i \right).$$

Our goal in this section is to extend this approach to potentially metastable inputs.

#### 4.2 Dealing with Metastable Inputs

Our strategy is to replace all involved operators by their metastable closure: for  $i \in [1, B]$  (i) compute  $s_{\mathbb{M}}^{(i)}$ , (ii) determine  $\max_{M}^{rg} \{g, h\}_i$  and  $\min_{M}^{rg} \{g, h\}_i$  according to Table 5, and finally (iii) exploit associativity of the operator computing the state  $s_{\text{M}}^{(i)}$  for usage in the PPC framework ([23], see Section 5). Thus, we only need to implement  $\diamond_{M}$  and the  $\operatorname{out}_{M}$ (both of constant size), plug them into the framework, and immediately obtain an efficient circuit.

The reader may ask why we compute  $s_{\rm M}^{(i)}$  for all  $i\in[0,B-1]$  instead of computing only  $s_{\rm M}^{(B)}$  with a simple tree of ⋄<sub>M</sub> elements, which would yield a smaller circuit. Since  $s_{\rm M}^{(B)}$  is the result of the comparison of the entire strings, it could be used to compute all outputs, i.e., we could compute the output by  $\operatorname{out}_{\mathbb{M}}(s_{\mathbb{M}}^{(B)}, g_i h_i)$  instead of  $\operatorname{out}_{\mathbb{M}}(s_{\mathbb{M}}^{(i-1)}, g_i h_i)$ . However, in case of metastability, this may lead to incorrect results. This can be seen in the example run of the FSM given in Table 7. We thus compute every intermediate state  $s_{\scriptscriptstyle M}^{(i)}$ .

Unfortunately, even with this modification it is not obvious that our approach yields correct outputs. There are three hurdles to overcome:

(P1) Show that  $\diamond_{M}$  is associative.

TABLE 7: Run of the FSM on inputs g=0M10 and h=0010, showing that computing only the last state is insufficient. This yields  $\operatorname{out_M}(1\text{M}, \text{M0}) = *\{00, 01, 10\} = \text{MM}$  as second output, but  $\operatorname{out_M}(00, \text{M0}) = *\{00, 10\} = \text{M0}$  is correct.

| i                                                                             | 0  | 1  | 2  | 3  | 4  |
|-------------------------------------------------------------------------------|----|----|----|----|----|
| $g_i h_i$                                                                     |    | 00 | MO | 11 | 00 |
| $s_{\mathtt{M}}^{(i)} = s_{\mathtt{M}}^{(i-1)} \diamond_{\mathtt{M}} g_i h_i$ | 00 | 00 | м0 | 1M | 1M |
| $\operatorname{out}_{\mathtt{M}}(s_{\mathtt{M}}^{(4)},g_{i}h_{i})$            |    | 00 | MM | 11 | 00 |
| $\operatorname{out}_{\mathtt{M}}(s_{\mathtt{M}}^{(i-1)},g_{i}h_{i})$          |    | 00 | М0 | 11 | 00 |

TABLE 8: The  $\diamond_{M}$  operator. The first operand is the current state, the second are the next input bits.

| $\diamond_{\mathbb{M}}$ | 00 | 0M | 01 | м1 | 11 | 1M | 10 | м0 | MM |
|-------------------------|----|----|----|----|----|----|----|----|----|
| 00                      | 00 | 0M | 01 | м1 | 11 | 1M | 10 | М0 | MM |
| 0M                      | 0м | M0 | 01 | м1 | м1 | MM | MM | MM | MM |
| 01                      | 01 | 01 | 01 | 01 | 01 | 01 | 01 | 01 | 01 |
| м1                      | м1 | MM | MM | MM | 0M | 0M | 01 | м1 | MM |
| 11                      | 11 | 1M | 10 | M0 | 00 | 0M | 01 | м1 | MM |
| 1M                      | 1M | 1M | 10 | M0 | M0 | MM | MM | MM | MM |
| 10                      | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
| M0                      | M0 | MM | MM | MM | 1M | 1M | 10 | м0 | MM |
| MM                      | MM | MM | MM | MM | MM | MM | MM | MM | MM |

- (P2) Show that repeated application of  $\diamond_{\mathbb{M}}$  computes  $s_{\mathbb{M}}^{(i)}$ .
- (P3) Show that applying  $\operatorname{out}_{\mathbb{M}}$  to  $s_{\mathbb{M}}^{(i-1)}$  and  $g_ih_i$  results for all valid strings in  $\max_{\mathbb{M}}^{\operatorname{rg}} \{g,h\}_i \min_{\mathbb{M}}^{\operatorname{rg}} \{g,h\}_i$ .

Regarding the first point, we note the statement that  $\diamond_{\mathbb{M}}$  is associative does not depend on B. In other words, it can be verified by checking for all possible  $x,y,z\in\mathbb{B}^2_{\mathbb{M}}$  whether  $(x\diamond_{\mathbb{M}}y)\diamond_{\mathbb{M}}z=x\diamond_{\mathbb{M}}(y\diamond_{\mathbb{M}}z).$  While it is tractable to manually verify all  $3^6=729$  cases (exploiting various symmetries and other properties of the operator), it is tedious and prone to errors. Instead, we verified that both evaluation orders result in the same outcome by a short computer program.

**Theorem 4.3.** (P1) holds, i.e.,  $\diamond_{\mathbb{M}}$  is associative.

Apart from being essential for our construction, this theorem simplifies notation; in the following, we may write

$$\left( \diamondsuit_{\mathtt{M}} \right)_{i=1}^{\jmath} g_i h_i \coloneqq g_1 h_1 \diamondsuit_{\mathtt{M}} g_2 h_2 \diamondsuit_{\mathtt{M}} \ldots \diamondsuit_{\mathtt{M}} g_j h_j ,$$

where the order of evaluation does not affect the result.

We stress that in general the closure of an associative operator needs not be associative. A counter-example is given by binary addition modulo 4:

$$(0M +_{M} 01) +_{M} 01 = MM \neq 1M = 0M +_{M} (01 +_{M} 01).$$

# 4.3 Determining $s_{\scriptscriptstyle \mathrm{M}}^{(i)}$

For convenience of the reader, Table 8 gives the truth table of  $\diamond_{\mathbb{M}} \colon \mathbb{B}^2 \times \mathbb{B}^2 \to \mathbb{B}^2$ . We need to show that repeated application of this operator to the input pairs  $g_j h_j$ ,  $j \in [1,i]$ , actually results in  $s_{\mathbb{M}}^{(i)}$ . This is closely related to the key observation that if in a valid string there is a metastable bit at position m, then the remaining B-m following bits are the maximum codeword of a (B-m)-bit code.

**Observation 4.4.** For  $g \in \mathcal{S}^B_{\mathrm{rg}}$ , if there is an index  $1 \leq m < B$  such that  $g_m = M$  then  $g_{m+1,B} = 10^{B-m-1}$ .

*Proof.* List the codewords in order. By the recursive definition of the code, removing the first m-1 bits of the

code leaves us with  $2^{m-1}$  repetitions of (B-m+1)-bit code alternating between listing it in order and in reverse ("reflected") order. Also by the recursive definition, the  $m^{th}$  bit toggles only when the (B-m)-bit code resulting from removing it is at its last codeword,  $10^{B-m-1}$ .

Our reasoning will be based on distinguishing two main cases: one is that  $s_{\rm M}^{(i)}$  contains at most one metastable bit, the other that  $s_{\rm M}^{(i)}={\rm MM}.$  For each we need a technical statement.

**Observation 4.5.** If  $\left| \operatorname{res} \left( s_{\mathbb{M}}^{(i)} \right) \right| \leq 2$  for any  $i \in [B+1]$ , then  $\operatorname{res}(s_{\mathbb{M}}^{(i)}) = \bigotimes_{i=1}^{i} \operatorname{res}(g_{i}h_{i})$ .

*Proof.* With  $S := \bigotimes_{j=1}^{i} \operatorname{res}(g_{j}h_{j})$ , we have that  $\operatorname{res}\left(s_{\mathbb{M}}^{(i)}\right) = \operatorname{res}(*S)$ . The claim thus follows from Observation 3.7.  $\square$ 

**Lemma 4.6.** Suppose that for valid strings  $g,h \in \mathcal{S}^B_{\mathrm{rg}}$ , it holds that  $s^{(i)}_{\mathbb{M}} = \mathrm{MM}$  for some  $i \in [1,B]$ . Then g=h and  $s^{(j)}_{\mathbb{M}} = \mathrm{MM}$  for all  $j \in [i,B]$ .

Equipped with these tools, we are ready to prove the second statement.

**Theorem 4.7.** (P2) holds, i.e., for all  $g, h \in \mathcal{S}_{rg}^B$  and  $i \in [1, B]$ ,  $s_{M}^{(i)} = \left( \diamondsuit_{M} \right)_{i=1}^{i} g_j h_j$ .

*Proof.* We show the claim by induction on i. Trivially, we have that  $s_{\rm M}^{(0)}=s^{(0)}=00$  and thus for i=1 that

$$s_{\mathbb{M}}^{(1)} = s_{\mathbb{M}}^{(0)} \diamondsuit_{\mathbb{M}} g_1 h_1 = 00 \diamondsuit_{\mathbb{M}} g_1 h_1 = g_1 h_1 = \left( \diamondsuit_{\mathbb{M}} \right)_{i=1}^1 g_1 h_1 \,.$$

Hence, suppose that the claim has been established for  $i-1 \in [1,B-1]$  and consider index i. If  $\left|\operatorname{res}\left(s_{\scriptscriptstyle \mathbb{M}}^{(i-1)}\right)\right| \leq 2$ , Observation 4.5 and the induction hypothesis yield that

$$\begin{split} \left( \diamondsuit_{\mathbf{M}} \right)_{j=1}^{i} g_{j} h_{j} &= \left( \left( \diamondsuit_{\mathbf{M}} \right)_{j=1}^{i-1} g_{j} h_{j} \right) \diamondsuit_{\mathbf{M}} g_{i} h_{i} \\ &= s_{\mathbf{M}}^{(i-1)} \diamondsuit_{\mathbf{M}} g_{i} h_{i} = * \left( \operatorname{res} \left( s_{\mathbf{M}}^{(i-1)} \right) \diamondsuit \operatorname{res} (g_{i} h_{i}) \right) \\ &= * \diamondsuit_{j=1}^{i} \operatorname{res} (g_{i} h_{i}) = s_{\mathbf{M}}^{(i)} \,. \end{split}$$

It remains to consider the case that  $s_{\mathbb{M}}^{(i-1)} = \mathbb{MM}$ . By Lemma 4.6,  $s_{\mathbb{M}}^{(i)} = \mathbb{MM}$ , too. Thus,

$$\left( \diamondsuit_{\mathtt{M}} \right)_{j=1}^{i} g_{j} h_{j} = s_{\mathtt{M}}^{(i-1)} \diamondsuit_{\mathtt{M}} g_{i} h_{i} = \mathtt{MM} \diamondsuit_{\mathtt{M}} g_{i} h_{i} = \mathtt{MM} = s_{\mathtt{M}}^{(i)}. \ \Box$$

TABLE 9: The  $out_M$  operator. The first operand is the current state, the second is the next input bits.

| $\operatorname{out}_{\mathtt{M}}$ | 00 | 0M | 01 | м1 | 11 | 1M | 10 | м0 | MM |
|-----------------------------------|----|----|----|----|----|----|----|----|----|
| 00                                | 00 | M0 | 10 | 1M | 11 | 1M | 10 | M0 | MM |
| 0M                                | 00 | M0 | 10 | 1M | 11 | MM | MM | MM | MM |
| 01                                | 00 | M0 | 10 | 1M | 11 | м1 | 01 | 0M | MM |
| м1                                | 00 | MM | MM | MM | 11 | м1 | 01 | 0M | MM |
| 11                                | 00 | 0M | 01 | м1 | 11 | м1 | 01 | 0M | MM |
| 1M                                | 00 | 0M | 01 | м1 | 11 | MM | MM | MM | MM |
| 10                                | 00 | 0M | 01 | м1 | 11 | 1M | 10 | M0 | MM |
| M0                                | 00 | MM | MM | MM | 11 | 1M | 10 | 0M | MM |
| MM                                | 00 | MM | MM | MM | 11 | MM | MM | MM | MM |

# 4.4 Obtaining the Outputs from $s_{\scriptscriptstyle \mathrm{M}}^{(i)}$

Recall that  $\operatorname{out} \colon \mathbb{B}^2 \times \mathbb{B}^2 \to \mathbb{B}^2$  is the operator given in Table 5 computing  $\max^{\operatorname{rg}} \{g,h\}_i \operatorname{min}^{\operatorname{rg}} \{g,h\}_i$  out of  $s^{(i-1)}$  and  $g_ih_i$ . For convenience of the reader, we provide the truth table of  $\operatorname{out}_{\mathbb{M}} \colon \mathbb{B}^2 \times \mathbb{B}^2 \to \mathbb{B}^2$  in Table 9. We prove the third property.

**Theorem 4.8.** (P3) holds, i.e., given valid inputs  $g, h \in \mathcal{S}^B_{rg}$  and  $i \in [1, B]$ ,  $\operatorname{out}_{\mathbb{M}}(s^{(i-1)}_{\mathbb{M}}, g_i h_i) = \max_{\mathbb{M}}^{rg} \{g, h\}_i \min_{\mathbb{M}}^{rg} \{g, h\}_i$ .

*Proof.* Assume first that 
$$\left|\operatorname{res}\left(s_{\scriptscriptstyle{\mathbb{M}}}^{(i-1)}\right)\right|\leq 2.$$
 Then

$$\operatorname{out}_{\mathbb{M}}(s_{\mathbb{M}}^{(i-1)}(g,h), g_{i}h_{i})$$

$$= * \operatorname{out}\left(\operatorname{res}\left(s_{\mathbb{M}}^{(i-1)}(g,h)\right), \operatorname{res}(g_{i}h_{i})\right)$$

$$\overset{\operatorname{Obs.} 4.5}{=} * \operatorname{out}\left(\bigotimes_{j=1}^{i-1}\operatorname{res}(g_{j}h_{j}), \operatorname{res}(g_{i}h_{i})\right)$$

$$\overset{\operatorname{Cor.} 4.2}{=} * (\max^{\operatorname{rg}}\{\operatorname{res}(g), \operatorname{res}(h)\}_{i} \operatorname{min}^{\operatorname{rg}}\{\operatorname{res}(g), \operatorname{res}(h)\}_{i})$$

$$= \max_{\mathbb{M}}^{\operatorname{rg}}\{q, h\}_{i} \operatorname{min}_{\mathbb{M}}^{\operatorname{rg}}\{q, h\}_{i}.$$

Otherwise,  $s_{\mathbb{M}}^{(i-1)} = \mathbb{MM}$ . Then, by Lemma 4.6, g = h. In particular,  $g_i = h_i$ . Checking Table 9, we see that for all  $b \in \mathbb{B}_{\mathbb{M}}$ , it holds that  $\mathrm{out}_{\mathbb{M}}(\mathbb{MM},bb) = bb$ . Therefore,

$$\operatorname{out}_{\mathbb{M}}(s_{\mathbb{M}}^{(i-1)}(g,h),g_{i}h_{i}) = g_{i}h_{i} = \max_{\mathbb{M}}^{\operatorname{rg}}\{g,h\}_{i} \operatorname{min}^{\operatorname{rg}}\{g,h\}_{i}$$
 in this case as well.

## 5 THE PPC FRAMEWORK

In order to derive a small circuit from the results of Section 4, a straightforward approach would be to unroll the FSM. We could design a circuit implementing the transition function  $\diamond_{\mathbb{M}}$  and apply it B times to the starting state  $s^{(0)}$  and each input  $g_ih_i$ . However, computing the sequence of states step by step yields a (non-optimal) linear depth of at least B.

Hence, we make use of the PPC framework by Ladner and Fischer [23]. They describe a generic method that is applicable to *any* finite state machine translating a sequence of B input symbols to B output symbols, to obtain circuits of size  $\mathcal{O}(B)$  and depth  $\mathcal{O}(\log B)$ . They observe that each input symbol defines a restricted transition function. Compositions of these functions evaluated on the starting state yield the state of the machine after receiving corresponding inputs. The major advantage of the technique is that compositions of restricted transition functions can be computed in parallel due to associativity, yielding a depth of  $\mathcal{O}(\log B)$ . This matches our needs, as we need to determine  $s_{\mathbb{M}}^{(i)}$  for each  $i \in [B]$ . However, their generic construction



Fig. 3: An example for a computation of the 2-sort(9) circuit arising from our construction for fan-out f=3. The inputs are g=101010110 and h=101 M 10000; see Table 10 for  $s_{\text{M}}^{(i)}(g,h)$  and the output. We labeled each  $\diamond_{\text{M}}$  by its output. Buffers and duplicated gates (here the one computing 0M) reduce fan-out, but do not affect the computation. Grey boxes indicate recursive steps of the PPC construction; see also Figure 7 for a larger PPC circuit using the one here in its "right" top-level recursion. For better readability, wires not taking part in a recursive step are dashed or dotted.

TABLE 10: Example run of the FSM in Figure 2 on inputs g=101010110 and  $h=101 \mathrm{M} 10000$ . We drop  $s_{\mathrm{M}}^{(9)}$ , as it is not needed to compute  $g_9'h_9'$ .

| i                     | 0  | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  |
|-----------------------|----|----|----|----|----|----|----|----|----|----|
| $g_i h_i$             |    | 11 | 00 | 11 | 0M | 11 | 00 | 10 | 10 | 00 |
| $s_{	exttt{M}}^{(i)}$ | 00 | 11 | 11 | 00 | 0м | м1 | м1 | 01 | 01 |    |
| $g_i'h_i'$            |    | 11 | 00 | 11 | M0 | 11 | 00 | 01 | 01 | 00 |

involves large constants. Fortunately, we have established that  $\diamond_{\mathbb{M}} \colon \mathbb{B}^2_{\mathbb{M}} \times \mathbb{B}^2_{\mathbb{M}}$  is an associative operator, permitting us to directly apply the circuit templates for associative operators they provide for computing  $s^{(i)}_{\mathbb{M}} = \left( \diamondsuit_{\mathbb{M}} \right)^i_{j=1} g_j h_j$  for all  $i \in [B]$ . Accordingly, we discuss these templates only. During discussion of the basic construction we show a minor improvement on their results.

Before proceeding, the reader may want to take a look at the example given in Figure 3, which shows how a 2-sort(9) derived from our construction processes an input pair.

### 5.1 The Basic Construction

We revisit the templates for parallel computation of all prefixes, i.e., the part of the framework relevant to our construction. To this end, recall Definition 1.1. In our case,  $\oplus = \diamond_{\mathbb{M}}$  and  $D = \mathbb{B}^2_{\mathbb{M}}$ . [23] provides a family of recursive constructions of  $\operatorname{PPC}_{\oplus}$  circuits. They are obtained by combining two different recursive patterns. The first pattern, which optimizes for size of the resulting circuits, is depicted in Figure 4a. We distinguish between even and odd number of inputs. If B is even, we discard the rightmost gray wire

and set  $\bar{B} \coloneqq B$ ; if B is odd, we set  $\bar{B} \coloneqq B-1$  and include the rightmost wire. In the following, denote by |C| the size of a circuit C and by d(C) its depth.

**Lemma 5.1.** Suppose that C and P are circuits implementing  $\oplus$  and  $\operatorname{PPC}_{\oplus}(\lceil B/2 \rceil)$  for some  $B \in \mathbb{N}$ , respectively. Then applying the recursive pattern given at the left of Figure 4 yields a  $\operatorname{PPC}_{\oplus}(B)$  circuit. It has depth 2d(C) + d(P) and size at most (B-1)|C| + |P|. Moreover, the last output is at depth at most d(C) + d(P) of the circuit.

*Proof.* Observe that P receives as inputs  $d_{2i-1} \oplus d_{2i}$  for  $i \in [1, \lfloor B/2 \rfloor]$ , and in addition  $d_B$  in case B is odd. Thus, it outputs  $\pi_i' = \bigoplus_{j=1}^{2i} d_j$  for  $i \in [1, \lfloor B/2 \rfloor]$ , and also  $\pi'_{\lceil B/2 \rceil} = \bigoplus_{j=1}^{B} d_j$  if B is odd. Hence, the circuit outputs  $\pi_i = \bigoplus_{j=1}^{i} d_j$  if  $i \in [1, B]$  is even and

$$\pi_i = \pi_{i-1} \oplus d_i = \left(\bigoplus_{j=1}^{2\lfloor i/2\rfloor} d_j\right) \oplus d_i = \bigoplus_{j=1}^i d_j$$

if  $i \in [1,B]$  is odd, showing correctness. The depth of the circuit is immediate from the construction, and the size follows from the fact that there is exactly one instance of C for each even  $i \in [1,B]$  before P and one for each odd  $i \in [1,B] \setminus \{1,B\}$  after P. Output  $\pi_B$  has a depth that is smaller by d(C), as it is an output of P.

The second recursive pattern, shown in Figure 4c, avoids to increase the depth of the circuit beyond the necessary d(C) for each level of recursion. Assume for now that B is a power of 2. We represent the recursion as a tree  $T_b$ , where  $b \coloneqq \log B$ , given in the center of Figure 6. It has depth b with all leaves (filled in white) in this depth, and there are two types of non-leaf nodes: right nodes (filled in black) have two children, a left and a right node, whereas left nodes (filled in gray) have a single child, which is a right node.  $T_b$  is essentially a Fibonacci tree in disguise.

**Definition 5.2.**  $T_0$  is a single leaf.  $T_1$  consists of the (right) root and two attached leaves. For  $b \geq 2$ ,  $T_b$  can be constructed from  $T_{b-1}$  and  $T_{b-2}$  by taking a (right) root r, attaching the root of  $T_{b-1}$  as its right child, a new left node  $\ell$  as the left child of r, and then attaching the root of  $T_{b-2}$  as (only) child of  $\ell$ .

The recursive construction is now defined as follows. A right node applies the pattern given in Figure 4 to the right.  $R_\ell$  is the circuit (recursively) defined by the subtree rooted at the left child and  $R_r$  is the circuit (recursively) defined by the subtree rooted at the right child. Furthermore,  $\bar{B}=2^{b-d-1}$ , where  $d\in[b]$  is the depth of the node. A left child applies the pattern on the left.  $R_c$  is (recursively) defined by the subtree rooted at its child and  $\bar{B}=2^{b-d}$ , where  $d\in[b]$  is the depth of the node.

The base case for a single input and output is simply a wire connecting the input to the output, for both patterns. As  $b = \log B$  and each recursive step cuts the number of inputs and outputs in half, the base case applies if and only if the node is a leaf. Note that the figure shows the recursive patterns at the root and its left child, where  $\bar{B} = 2^{b-1}$  is always even (i.e., in this recursive pattern, the gray wire with index  $\bar{B} + 1$  is never present); when applying the patterns to nodes further down the tree,  $\bar{B}$  and B are scaled down by a factor of 2 for every step towards the leaves.

In the following, denote by  $PPC(C, T_b)$  the circuit that results from applying the recursive construction described above to the base circuit C implementing  $\oplus$ . Moreover, we refer to the  $i^{th}$  input and output of the subcircuit corresponding to node  $v \in T_b$  as  $d^v_i$  and  $\pi^v_i$ , respectively.

**Lemma 5.3.** If C implements  $\oplus$ ,  $PPC(C, T_b)$  is a  $PPC_{\oplus}(2^b)$  circuit.

*Proof.* We show the claim by induction on b. For b=0, the circuit correctly wires the input to the output, as we have only one leaf. For b=1, the first output equals the first input and the second output is the result of feeding both inputs into a copy of C.

For  $b\geq 2$ , by the induction hypothesis the circuit  $R_c$  used in the construction at the left child of the root is a  $\operatorname{PPC}_{\oplus}(2^{b-2})$  circuit. By Lemma 5.1, the circuit  $R_\ell$  in the construction at the root is thus a  $\operatorname{PPC}_{\oplus}(2^{b-1})$  circuit, showing that it outputs  $\pi_i^\ell = \bigoplus_{j=1}^i d_j = \pi_i$  for all  $i \in [1, 2^{b-1}]$ . From the induction hypothesis for b-1, we get that the circuit  $R_r$  used in the construction at the root is a  $\operatorname{PPC}_{\oplus}(2^{b-1})$  circuit, showing that it outputs  $\pi_i^r = \bigoplus_{j=2^{b-1}+1}^{2^{b-1}+i} d_j$  for all  $i \in [1, 2^{b-1}]$ . By construction of the right recursion pattern we conclude that for  $i \in [2^{b-1}+1, 2^b]$ , we get the outputs  $\pi_{2^{b-1}} \oplus \pi_{i-2^{b-1}}^r = \bigoplus_{j=1}^i d_i = \pi_i$ .

**Lemma 5.4.** PPC( $C, T_b$ ) has depth  $b \cdot d(C)$ .

*Proof.* We prove the claim by induction on b; it trivially holds for b=0, as we have only one leaf. For b=1,  $T_b$  is a right node with two leaves. The two leaves have depth 0; clearly, applying the right pattern from Figure 4 then results in depth d(C). For  $b\geq 2$ , the subcircuit  $R_r$  at the root has depth  $(b-1)\cdot d(C)$  by the induction hypothesis. For the subcircuit  $R_\ell$  at the root, consider its subcircuit  $R_c$ . By the induction hypothesis it has depth  $(b-2)\cdot d(C)$ . Hence, by Lemma 5.1,  $R_\ell$  has depth  $b\cdot d(C)$ , but its rightmost output  $\pi_{2^{b-1}}^{\ell}$  has depth only  $(b-1)\cdot d(C)$ . Thus, by construction the root's circuit has depth  $b\cdot d(C)$ .

It remains to bound the size of the circuit. Denote by  $F_i$ ,  $i \in \mathbb{N}$ , the  $i^{th}$  Fibonacci number, i.e.,  $F_1 = F_2 = 1$  and  $F_{i+1} = F_i + F_{i-1}$  for all  $2 \le i \in \mathbb{N}$ .

**Lemma 5.5.** PPC( $C, T_b$ ) has size  $(2^{b+2} - F_{b+5} + 1)|C|$ .

*Proof.* Denote by  $s_b$  the number of copies of |C| in the circuit  $\operatorname{PPC}(C,T_b)$ . We show by induction that  $s_b=2^{b+2}-F_{b+5}+1$  for all  $b\in\mathbb{N}_0$ . We have that  $s_0=0=2^2-F_5+1$  and that  $s_1=1=2^3-F_6+1$ . For  $b\geq 2$ , we have that  $s_b=s_{b-1}+s_{b-2}+s_r+s_\ell$ , where  $s_r$  and  $s_\ell$  denote the size contribution of the recursive steps at the root and its left child, respectively. Checking the recursive patterns in Figure 4, we see that  $s_r=B-\bar{B}=2^{b-1}$  and  $s_\ell=\bar{B}-1=2^{b-1}-1$ . Thus,  $s_b=s_{b-1}+s_{b-2}+2^b-1$ , which the induction hypothesis yields

$$s_b = 2^{b+1} - F_{b+4} + 1 + 2^b - F_{b+3} + 1 + 2^b - 1$$
  
=  $2^{b+2} - F_{b+5} + 1$ .

Asymptotically, the subtractive term of  $F_{b+5}$  is negligible, as  $F_{b+5} \in (1/\sqrt{5}+o(1))((1+\sqrt{5})/2)^{b+5} \subseteq \mathcal{O}(1.62^b)$ ; however, unless B is large, the difference is substantial. We also get a simple upper bound for arbitrary values of B. To this end, we "split" in the recursion such that the left branch



- (a) Recursion pattern of left nodes.
- (b) Recursion tree  $T_4$ .
- (c) Recursion pattern of right nodes.

Fig. 4: The recursion tree  $T_4$  (center). Right nodes are depicted black, left nodes gray, and leaves white. The recursive patterns applied at left and right nodes are shown on the left and right, respectively. At the root and its left child, we have that  $\bar{B} = B/2$ ; for other nodes,  $\bar{B}$  gets halved for each step further down the tree (where the leaves simply wire their single input to their single output). The left pattern comes in different variants. The gray wire with index  $\bar{B}+1$  is present only if B is odd; this never occurs in  $\mathrm{PPC}(C,T_b)$ , but becomes relevant when initially applying the left pattern exclusively for  $k \in \mathbb{N}$  steps (see Theorem 5.7), reducing the size of the resulting circuit at the expense of increasing its depth by k.

is "complete" (i.e. the number of inputs is a power of 2), while applying the same splitting strategy on the right. This is where our construction differs from and improves on [23]. They perform a balanced split and obtain an upper bound of 4B on the circuit size.

**Corollary 5.6.** For  $B \in \mathbb{N}$  and circuit C implementing  $\oplus$ , set  $b := \lceil \log B \rceil$ . Then a  $\operatorname{PPC}_{\oplus}(B)$  of depth  $\lceil \log B \rceil d(C)$  and size smaller than  $(5B - 2^b - F_{b+3})|C| \leq (4B - F_{b+3})$  exists.

*Proof.* If B is a power of 2, the claim follows from Lemmas 5.3, 5.4, and 5.5. In particular, for B=1 and B=2, respectively,  $\operatorname{PPC}_{\oplus}(C,T_0)$  and  $\operatorname{PPC}_{\oplus}(C,T_1)$  meet the requirements. For B>2 that is not a power of 2, set  $b\coloneqq\lceil\log B\rceil$  and perform the same construction as for  $\operatorname{PPC}(C,T_b)$ , but replace  $R_r$  at the root by the (recursively given)  $\operatorname{PPC}_{\oplus}(B-2^{b-1})$  circuit.

Correctness is immediate from the recursive construction and Lemma 5.3. Similarly, the depth bound follows from Lemma 5.4 and the recursive construction. Regarding size, we show by induction that  $s_B$ , the number of copies of C required for a circuit for B inputs, satisfies the claimed bound. This is already established for the base cases of B=1 and B=2. For B>3, we apply Lemma 5.1 to  $R_\ell$  in the root's circuit and Lemma 5.5 to its subcircuit  $R_c$ , while applying the induction hypothesis to the subcircuit  $R_T$  in the root's circuit. We get that

$$s_{B} < s_{2^{b-2}} + |\operatorname{PPC}_{\oplus}(C, T_{b-1})| + (B-1)$$

$$< (2^{b} - F_{b+3} + 1 + 4(B-2^{b-1}) + B - 1)$$

$$= (5B - 2^{b} - F_{b+3}).$$

We remark that one can give more precise bounds by making case distinctions regarding the right recursion, which for the sake of brevity we omit here. Instead, we computed the exact numbers for  $B \leq 70$ , see Figure 5.

The construction derived from iterative application of Lemma 5.1 can be combined with  $PPC(C,T_b)$ , achieving the following trade-off; note that if  $B=2^b$  for  $b\in\mathbb{N}$ , then  $F\lceil \log B\rceil -k+3$  can be replaced by  $F_{b-k+5}$ .

**Theorem 5.7** (improving on [23]). Suppose C implements  $\oplus$ . For all  $k \in [0, \lceil \log B \rceil]$  and  $B \in \mathbb{N}$ , there is a  $\operatorname{PPC}_{\oplus}(B)$  circuit of depth  $(\lceil \log B \rceil + k)d(C)$  and size at most

$$\left(\left(2+\frac{1}{2^{k-1}}\right)B-F_{\lceil \log B\rceil-k+3}\right)|C|.$$

*Proof.* For k steps, we apply Lemma 5.1, where in the final recursive step we use the circuit from Corollary 5.6. Correctness is immediate from Lemma 5.1 and Corollary 5.6.

Denote by  $B_i$ ,  $i \in [k+1]$ , the number of in- and outputs of the (sub)circuit at depth i of the recursion, i.e.,  $B_0 = B$  and  $B_{i+1} = \lceil B_i/2 \rceil$  for all  $i \in [k]$ . We have that  $B_i \leq B/2^i + \sum_{j=1}^i 2^{-j} < B/2^i + 1$ , which follows inductively via

$$B_{i+1} = \left\lceil \frac{B_i}{2} \right\rceil \le \left\lceil \frac{B_0/2^i + \sum_{j=1}^i 2^{-j}}{2} \right\rceil$$
$$\le \frac{B_0}{2^{i+1}} + \left( \sum_{j=1}^i 2^{-j-1} \right) + \frac{1}{2}$$
$$= \frac{B}{2^{i+1}} + \sum_{j=1}^{i+1} 2^{-j}.$$

Observe that  $\lceil \log B_{i+1} \rceil = \lceil \log B_i \rceil - 1$  for all  $i \in [k]$ . By Lemma 5.1 and Corollary 5.6, the size of the resulting circuit is thus bounded by

$$\left(\frac{4B}{2^k} - F_{\lceil \log B \rceil - k + 3} + \sum_{i=0}^{k-1} (B_i - 1)\right) |C|$$

$$< \left(\frac{4B}{2^k} - F_{\lceil \log B \rceil - k + 3} + \sum_{i=0}^{k-1} \frac{B}{2^i}\right) |C|$$

$$= \left(\left(\frac{2}{2^{k-1}} + 2 - \frac{1}{2^{k-1}}\right) B - F_{\lceil \log B \rceil - k + 3}\right) |C|$$

$$= \left(\left(2 + \frac{1}{2^{k-1}}\right) B - F_{\lceil \log B \rceil - k + 3}\right) |C|.$$

Finally, Lemmas 5.1 and 5.4 show that the circuit has depth

$$(2k + \lceil \log B_k \rceil)d(C) = (\lceil \log B \rceil + k)d(C)$$
.



Fig. 5: Comparison of the balanced recursion from [23] and ours. The curves for unbounded fan-out are the exact sizes obtained, whereas "upper bound" refers to the bound from Corollary 5.6; the fan-out 3 curves show that the unbalanced strategy performs better also for the construction from Theorem 5.17 (for f=3 and k=0) we derive next.

### 5.2 Constant Fan-out at Optimal Depth

The optimal depth construction incurs an excessively large fan-out of  $\Theta(B)$ , as the last output of left recursive calls needs to drive all the copies of C that combine it with each of the corresponding right call's outputs. This entails that, despite its lower depth, it will not result in circuits of smaller physical delay than simply recursively applying the construction from Figure 4a. Naturally, one can insert buffer trees to ensure a constant fan-out (and thus constantly bounded ratio between delay and depth), but this increases the depth to  $\Theta(\log^2 B + d(C)\log B)$ .

We now modify the recursive construction to ensure a constant fan-out, at the expense of a limited increase in size of the circuit. The result is the first construction that has size  $\mathcal{O}(B)$ , optimal depth, and constant fan-out.

In the following, we denote by  $f \geq 3$  the maximum fanout we are trying to achieve, where we assume that gates or memory cells providing the input to the circuit do not need to drive any other components. For simplicity, we consider C to be a single gate, i.e., a gate driving two C components has exactly fan-out 2.

We proceed in two steps. First, we insert 2B buffers into the circuit, ensuring that the fan-out is bounded by 2 everywhere except at the gate providing the last output of each subcircuit corresponding to a left node. In the second step, we will resolve this by duplicating these gates sufficiently often, recursively propagating the changes down the tree. Neither of these changes will affect the output (i.e. the correctness) of the circuit or its depth, so the main challenges are to show our claim on the fan-out and bounding the size of the final circuit.

#### 5.2.1 Step 1: Almost Bounding Fan-out by 2

Before proceeding to the construction in detail, we need some structural insight on the circuit.

**Definition 5.8.** For node  $v \in T_b$ , define its range  $R_v$  and left-count  $\alpha_v$  recursively as follows.

- If v is the root, then  $R_v = [1, 2^b]$  and  $\alpha_v = 0$ .
- If v is the left child of p with  $R_p = [i, i+j]$ , then  $R_v = [i, i+(j+1)/2]$  and  $\alpha_v = \alpha_p$ .
- If v is the right child of right node p with  $R_p = [i, i+j]$ , then  $R_v = [i+(j+1)/2+1, i+j]$  and  $\alpha_v = \alpha_p$ .

• If v is the right child of left node p, then  $R_v = R_p$  and  $\alpha_v = \alpha_p + 1$ .

Hence, the left-count  $\alpha_v$  tells us for every node  $v \in T_b$  the number of left recursion steps preceding v, whereas  $R_v$  gives us information about the range of inputs used at node v. We observe that each recursion halves the number of inputs and that the range is only cut in half if  $\alpha_v$  does not increase. Combining these observations with structural insights on the recursion patterns in Figures 4a and 4c, we state the following four properties of  $\operatorname{PPC}(C, T_b)$ .

**Lemma 5.9.** Suppose the subcircuit of  $PPC(C, T_b)$  represented by node  $v \in T_b$  in depth  $d \in [b+1]$  has range  $R_v = [i, i+j]$ . Then

- (i) it has  $2^{b-d}$  inputs,
- (ii)  $j = 2^{b-d+\alpha_v} 1$ ,
- (iii) if v is a right node, all its inputs are outputs of its childrens' subcircuits, and
- (iv) if v is a left node or leaf, only its even inputs are provided by its child (if it has one) and for odd  $k \in [1, 2^{b-d}]$ , we have that  $d_k^v = \bigoplus_{k'=i+(k-1)2^{\alpha_v}}^{i+k2^{\alpha_v}-1} d_{k'}$ .

*Proof.* Property (i) is immediate from the fact that with each step of recursion, the number of input and output wires is cut in half. Checking the above definition, we see that the range stays the same whenever  $\alpha_v$  increases, and otherwise is halved on each recursive step; Property (ii) follows. Property (iii) can be readily verified from Figure 4c.

The final property is shown by induction on b. It is immediate for b=0 and b=1. For  $b\geq 2$ , the subcircuit of the left child  $\ell$  of the root has  $2^{b-1}$  inputs, the odd ones of which are inputs to the overall circuit (cf. Figures 4a and 4c). As we have  $\alpha_v=0$ , we get that  $d_k^\ell=d_{i+k}=\bigoplus_{k'=i+(k-1)2^{\alpha_v}}^{i+k2^{\alpha_v}-1}d_{k'}$  and the node satisfies the claim. For the subcircuit  $R_r$  corresponding to the subtree rooted at the right child of the root, the claim holds by the induction hypothesis applied to b-1. For the subcircuit  $R_\ell$  of the left child, we see from Figure 4a that the subcircuit  $R_c$  corresponding to the subtree rooted at its child c receives inputs  $d_k^c=d_{2k-1}\oplus d_{2k}$ ,  $k\in[1,2^{b-1}]$ . Combining this with the induction hypothesis for b-2, the induction step is completed also in this case.  $\square$ 

Lemma 5.9 leads to an alternative representation of the circuit  $PPC(C,T_b)$ , see Figure 6, in which we separate gates in the recursive pattern from Figure 4a that occur before the subcircuit  $R_c$ . Adding the buffers we need in our construction, this results in the modified patterns given in Figure 6b. The separated gates appear at the bottom of Figure 6a: for each leaf v of  $T_b$ , there is a tree of depth  $\alpha_v$  aggregating all of the circuit's inputs from its range. Each non-root node in an aggregation tree provides its output to its parent. In addition, one of the two children of an inner node in the tree must provide its output as an input to one of the subcircuits corresponding to a node of  $T_b$ , cf. Property (iv) of Lemma 5.9.

From this representation, we will derive that the following modifications of  $\operatorname{PPC}(C,T_b)$  result in a  $\operatorname{PPC}_{\oplus}(2^b)$  circuit  $\operatorname{PPC}(C,T_b)'$ , for which a fan-out larger than 2 exclusively occurs on the last outputs of subcircuits corresponding to nodes of  $T_b$ .



- (a) Recursion tree  $T_4$  with separated aggregation trees and added buffers.
- (b) Recursive patterns  $R_r$  and  $R_\ell$ .

Fig. 6: Construction of  $PPC(C, T_4)'$ . On the left, we see the recursion tree, with the aggregation trees separated and shown at the bottom. Inputs are depicted as black triangles. On the right, the application of the recursive patterns at the children of the root is shown. Parts marked blue will be duplicated in the second step of the construction that achieves constant fan-out; this will also necessitate to duplicate some gates in the aggregation trees.

- 1) Add a buffer on each wire connecting a non-root node of any of the aggregation trees to its corresponding subcircuit (see Figure 6a).
- 2) For the subcircuit corresponding to left node  $\ell$  with range  $R_{\ell} = [i, i+j]$ , add for each even  $k \leq j$  (i.e., each even k but the maximum of j+1) a buffer before output  $\pi_k^{\ell}$  (see bottom of Figure 6b).
- 3) For each right node r with range [i, i+j], add a buffer before output  $\pi^r_{(i+1)/2}$  (see top of Figure 6b).

**Lemma 5.10.** With the exception of gates providing the last output of subcircuits corresponding to nodes of  $T_b$  (blue in Figure 6b), fan-out of  $PPC(C, T_b)'$  is 2. Buffers or gates driving an output of the circuit drive nothing else.

*Proof.* First, we prove the following invariant: If each input to a subcircuit of  $PPC(C, T_b)'$  corresponding to a node of  $T_b$  is driven by a gate or buffer driving no other wires, the same holds true for the outputs of the subcircuit. Suppose for contradiction that this invariant is violated and consider a minimal subcircuit doing so. There are three cases.

- The subcircuit corresponds to a leaf. This is a contradiction, as then the subcircuit simply is a wire connecting its sole input to its output.
- The subcircuit corresponds to a right node r (cf. top of Figure 6b) with range [i,i+j]. As the invariant applies to the subcircuit corresponding to its left child, its outputs  $\pi_1^r,\ldots,\pi_{(j-1)/2}^r$  satisfy the invariant. Its output  $\pi_{(j+1)/2}^r$  satisfies the invariant due to the inserted buffer. As the remaining outputs are driven by gates that drive nothing else, this case also leads to a contradiction.
- The subcircuit corresponds to a left node  $\ell$  (cf. bottom of Figure 6b) with range [i,i+j]. As  $d_1^\ell$  is simply wired to  $\pi_1^\ell$  (and nothing else), it satisfies the invariant. The last output  $\pi_{j+1}^\ell$  satisfies the invariant, because the recursively used subcircuit does. The remaining outputs are driven by gates or buffers driving nothing else, again resulting in a contradiction.

As all cases result in a contradiction, the invariant holds.

Next, observe that, by construction, the aggregation trees have fan-out 2 after buffer insertion. Each buffer or gate from this part of the circuit drives exactly one wire connecting it to the remainder of the circuit. Thus, the above invariant shows that all subcircuits corresponding to nodes of  $T_b$  satisfy that each of their outputs is driven by gate or a buffer driving nothing else. Checking Figure 6b, we can thus conclude that indeed no gate or buffer drives more than two others, unless it provides the last output to one of the recursively used subcircuits in the construction at a right node; gates or buffers driving an output of  $\operatorname{PPC}(C, T_b)'$  drive only this output.

It remains to count the inserted buffers. We do so by computing a closed form expression from the linear recurrence that describes the number of nodes of a given type (left, right, leaf) in a given depth as function of the previous one. The following helper statement will be useful for this, but also later on.

**Lemma 5.11.** Denote by  $L_b \subseteq T_b$  the set of leaves of  $T_b$ . Then  $|L_b| = F_{b+2}$  and  $\sum_{v \in L_b} 2^{\alpha_v} = 2^b$ .

*Proof.* We have that  $|L_0|=1=F_2$ ,  $|L_1|=2=F_3$ , and, by Observation 5.2, for  $b\geq 2$  that  $|L_b|=|L_{b-1}|+|L_{b-2}|$ . This recurrence has solution  $|L_b|=F_{b+2}$ .

Next, consider the recurrence given by  $L'_0=1$ ,  $L'_1=2$ , and  $L'_b=L'_{b-1}+2L'_{b-2}$  for  $b\geq 2$ ; the factor of 2 assigns twice the weight to the subtree rooted at the child of the root's left child, thereby ensuring that each leaf is accounted with weight  $2^{\alpha_v}$ . This recurrence has solution  $2^b$ .

**Lemma 5.12.** Denote by s the size of a buffer. Then

$$|\operatorname{PPC}(C, T_b)'| = |\operatorname{PPC}(C, T_b)| + (2^b + 2^{b-1} - F_{b+3}) s.$$

*Proof.* To count the number c(b) of buffers in subcircuits corresponding to nodes of  $T_b$ , we observe that c(0) = 0, c(1) = 1, and for  $b \ge 2$  it holds that  $c(b) = 2^{b-2} + c(b-1) + c(b-2)$ : 1 buffer at the root (see top of Figure 6b),  $2^{b-2} - 1$  buffers at its left child (see bottom of Figure 6b), c(b-1) buffers for the subtree rooted at its right child, and c(b-2) buffers for the

subtree rooted at the child of its left child. This recurrence relation has the solution  $c(b)=2^b-F_{b+1}$ .

To count the number of buffers attached to the aggregation trees, recall that from depth  $d \neq 0$  of each tree, exactly half of the nodes' output is required by a buffer connected to some node in  $T_b$  (this follows from Lemma 5.9 and the fact that ranges of nodes in the same depth of  $T_b$  form a partition of  $[1,2^b]$ ). Thus, this number equals  $\sum_{v\in L_b} 2^{\alpha_v-1} - 1$ . By Lemma 5.11, the total number of buffers thus equals

$$2^{b} - F_{b+1} + 2^{b-1} - F_{b+2} = 2^{b} + 2^{b-1} - F_{b+3}$$
.  $\square$ 

Similar arguments serve later as well. The main reason why we will define the function a(v) in the next section without rounding is to ensure that we again obtain linear recurrences, which can be solved using standard techniques from linear algebra. As a downside, this results in slightly overestimating the size of circuits, as we may ask for more copies of gates from children than are actually needed.

# 5.2.2 Step 2: Bounding Fan-out by f

In the second step, we need to resolve the issue of high fanout of the last output of each recursively used subcircuit in  $PPC(C, T_b)'$ . Our approach is straightforward. Starting at the root of  $T_b$  and progressing downwards, we label each node v with a value a(v) that specifies a sufficient number of additional copies of the last output of the subcircuit represented by v to avoid fan-out larger than f. At right nodes, this is achieved by duplicating the gate computing this output sufficiently often, marked blue in Figure 6b (top). For left nodes, we simply require the same number of duplicates to be provided by the subcircuit represented by their child (i.e., we duplicate the blue wire in the bottom recursive pattern shown in Figure 6b). Finally, for leaves, we will require a sufficient number of duplicates of the root of their aggregation tree; this, in turn, may require to make duplicates of their descendants in the aggregation tree.

We define a(v) and then utilize it to describe our fan-out f circuit. Afterwards, we will analyze the increase in size of the circuit compared to  $PPC(C, T_b)'$ .

**Definition 5.13** (a(v)). Fix  $b \in \mathbb{N}_0$ . For  $v \in T_b$  in depth  $d \in [b+1]$ , define

$$a(v) \coloneqq \begin{cases} 0 & \text{if } v \text{ is the root} \\ \frac{a(p) + 2^{b-d}}{f} & \text{if } v \text{ is the left child of } p \\ \frac{a(p)}{f} & \text{if } v \text{ is the right child of right node } p \\ a(p) & \text{if } v \text{ is the (only) child of left node } p. \end{cases}$$

**Lemma 5.14.** Suppose that for each leaf  $v \in T_b$ , there are  $\lfloor a(v) \rfloor$  additional copies of the root of the aggregation tree, and for each right node  $v \in T_b$ , we add  $\lfloor a(v) \rfloor$  gates that compute (copies of) the last output of their corresponding subcircuit of  $PPC(C, T_b)'$ . Then we can wire the circuit such that all gates that are not in aggregation trees have fan-out at most f, and each output of the circuit is driven by a gate or buffer driving only this output.

*Proof.* We prove the claim by induction on the depth of nodes, starting at the leaves. The base case holds by assumption, as each leaf is provided with sufficiently many copies of its (single) input. For the induction step from depth d+1 to  $d \in [b]$ , fix some  $v \in T_b$  in depth d. By Lemma 5.10,

we only need to consider the last output of the subcircuit corresponding to the node; there are in total  $1+\lfloor a(v)\rfloor$  gates providing it. We distinguish four cases.

• v is a left node. Thus, its child c is a right node with  $1 + \lfloor a(c) \rfloor = 1 + \lfloor a(v) \rfloor$  gates providing its last output. As the parent p of v is a right node, these need to drive  $1 + 2^{b-d} + \lfloor a(p) \rfloor$  gates (cf. Figure 6b). We have that

$$f(1 + \lfloor a(v) \rfloor) = f\left(1 + \left\lfloor \frac{a(p) + 2^{b-d}}{f} \right\rfloor\right)$$
  
 
$$\geq f + a(p) + 2^{b-d} - (f-1)$$
  
 
$$\geq 1 + 2^{b-d} + \lfloor a(p) \rfloor.$$

- v is a right node with left parent p. This was already covered in the previous case from the viewpoint of p.
- v is a right node with right parent p. As p is also a right node, we need to drive  $1 + \lfloor a(p) \rfloor$  gates (cf. Figure 6b). We have that

$$f(1 + \lfloor a(v) \rfloor) = f\left(1 + \left\lfloor \frac{a(p)}{f} \right\rfloor\right)$$
  
 
$$\geq f + a(p) - (f - 1) \geq 1 + \lfloor a(p) \rfloor.$$

• *v* is the root. Thus, it provides the outputs to the circuit, and the claim is immediate from Lemma 5.10. □

It remains to modify the aggregation trees so that sufficiently many copies of the roots' output values are available.

**Lemma 5.15.** Consider an aggregation tree corresponding to leaf  $v \in T_b$  and fix  $f \geq 3$ . We can modify it such that the fanout of all its non-root nodes becomes at most f, there are  $\lfloor a(v) \rfloor$  additional gates computing the same output as the root, and at most  $(fa(v))/(f-2) + (2^{\alpha_v-1})/(f-1)$  gates are added.

*Proof.* We recursively assign a value  $a_d$  to each depth  $d \in [\alpha_v + 1]$  of the aggregation tree that bounds the number of additional gates in this depth from above. Accordingly,  $a_0 \coloneqq a(v)$ . To determine a suitable recurrence, recall that for each node in the tree, one child needs to also drive a buffer, while the other does not (Lemma 5.9). Fix one child c in depth d and its parent p, and denote by a(c) and a(p) the number of additional gates required. If c also needs to drive a buffer, it is sufficient that  $a(c) \ge \lfloor (a(p)+1)/f \rfloor$ , as  $f(1+\lfloor (a(p)+1)/f \rfloor) \ge 2+a(p)$ ; similarly,  $a(c) \ge \lfloor a(p)/f \rfloor$  is sufficient, if the child does not need to drive an additional buffer. As aggregation trees are complete balanced binary trees, summing over all nodes in depth d we thus get that  $a_{d+1} = (2a_d + 2^d)/f$  for all  $d \in [\alpha_v]$  is sufficient. This recurrence has solution

$$a_d = \left(\frac{2}{f}\right)^d a(v) + \frac{2^{d-1}}{f-1} \left(1 - \frac{1}{f^d}\right).$$

The total number of additional gates up to depth  $\alpha_v-1$  is thus bounded by

$$\begin{split} \sum_{d=0}^{\alpha_v-1} a_d &= \sum_{d=0}^{\alpha_v-1} \left(\frac{2}{f}\right)^d a(v) + \frac{2^{d-1}}{f-1} \left(1 - \frac{1}{f^d}\right) \\ &< \frac{fa(v)}{f-2} + \frac{2^{\alpha_v-1}}{f-1} \,, \end{split}$$

matching the claimed bound.



Fig. 7:  $PPC^{(3)}(C, T_4)$ . Right recursion steps  $R_r$  are marked with dark gray, left recursion steps with light gray. The step at the root (above) and aggregation trees (below) are not marked explicitly. Duplicated gates are depicted in a layered fashion. Dashed lines indicate that a wire is not participating in a recursive step.

It remains to show that no gates need to be provided at the leaves of the aggregation tree; beside showing the claimed bound on the increase in size of the circuit, this is also necessary, because we cannot make copies of inputs without increasing the depth of the circuit. As  $f \geq 3$  and nodes in aggregation trees have fan-out 2 in  $\operatorname{PPC}(C, T_b)'$ , we can at least double the number of copies of each node with each level of the aggregation tree. As the tree at leaf v has depth  $\alpha_v$ , it is hence sufficient to show that  $a(v) \leq 2^{\alpha_v}$ .

To bound a(v) from above, we again exploit the recursive structure of the construction. Denote by A(b,x) an upper bound on  $a(v)/2^{\alpha_v}$  for all leaves  $v \in T_b$  when defining the values a(v) as usual, except that we set a(r) = x for some  $0 \le x \le 2^{b-1}$  at the root  $r \in T_b$ . For  $b \ge 2$ , we get that

$$A(b,x) \le \max \left\{ A\left(b-1, \frac{x}{f}\right), A\left(b-2, \frac{2^{b-1}+x}{2f}\right) \right\} < \max \left\{ A\left(b-1, 2^{b-2}\right), A\left(b-2, 2^{b-2}\right) \right\}.$$

Moreover, we have that  $A(0,x)=x\leq 1/2$  and  $A(1,x)=(x+1)/f\leq 2/3$  for feasible values of x, where we use that  $f\geq 3$ . Hence,  $A(b,x)\leq 2/3$  for all b and feasible values of x. In particular,  $A(b,0)\leq 2/3$ , implying that indeed  $a(v)<2^{\alpha_v}$  for all leaves  $v\in T_b$ .

Finally, we need to count the total number of gates we add when implementing these modifications to the circuit.

**Lemma 5.16.** For  $f \geq 3$ , define  $PPC^{(f)}(C, T_b)$  by modifying  $PPC(C, T_b)'$  according to Lemmas 5.14 and 5.15. Then, with  $\lambda_1 := (1 + \sqrt{5})/4$ ,  $|PPC^{(f)}(C, T_b)|$  is bounded by

$$|\operatorname{PPC}(C, T_b)'| + 2^b \left(\frac{1}{2f - 2} + \frac{2}{f - 2} + \mathcal{O}\left(\frac{\lambda_1^b}{f^2}\right)\right) |C|.$$

*Proof.* Denote by  $R_b \subset T_b$  the set of right nodes in  $T_b$ . By Lemma 5.14, we add at most  $\sum_{v \in R_b} a(v)$  gates to the part

of the circuit corresponding to  $T_b$ . Recall that  $L_b \subseteq T_b$  is the set of leaves of  $T_b$ . By Lemmas 5.11 and 5.15, we add at most

$$\sum_{v \in L_b} \frac{fa(v)}{f-2} + \frac{2^{\alpha_v-1}}{f-1} = \frac{2^{b-1}}{f-1} + \sum_{v \in L_b} \frac{fa(v)}{f-2}$$

gates to the aggregation trees.

To bound these numbers, denote for  $d \in [b]$  by  $x_d$  and  $y_d$  the sum over right and left nodes' a(v) values in depth d, respectively. Moreover, let  $l_d$  and  $r_d$  denote  $2^{b-d}$  times the number of left and right nodes in depth d, respectively. Thus, we seek to bound  $f(x_b + y_b)/(f-2) + \sum_{d=0}^{b-1} x_d$ . We have that  $x_0 = y_0 = 0$ ,  $l_1 = r_1 = 2^{b-1}$ , and

$$\begin{pmatrix} x_d \\ y_d \\ l_{d+1} \\ r_{d+1} \end{pmatrix} = \begin{pmatrix} f^{-1} & 1 & 0 & 0 \\ f^{-1} & 0 & f^{-1} & 0 \\ 0 & 0 & 0 & 1/2 \\ 0 & 0 & 1/2 & 1/2 \end{pmatrix}^d \begin{pmatrix} x_0 \\ y_0 \\ l_1 \\ r_1 \end{pmatrix}$$

for all  $d \in [b-1]$ . The recurrence matrix has the eigenvalues

$$\lambda_1 = \frac{1}{4}(1+\sqrt{5}), \quad \lambda_3 = \frac{1}{2f}(1+\sqrt{1+4f}),$$

$$\lambda_2 = \frac{1}{4}(1-\sqrt{5}), \quad \lambda_4 = \frac{1}{2f}(1-\sqrt{1+4f}).$$

The recurrence has solution  $r_d = 2^{b-d}F_{d+1}$ ,  $l_d = 2^{b-d}F_d$ ,

$$x_{d} = \frac{2^{b+1}}{f^{2} - 10f + 20} \cdot \left(\frac{f^{2} - 3f - 2}{\sqrt{1 + 4f}} \left(-\lambda_{3}^{d} + \lambda_{4}^{d}\right) - (f - 2)\left(\lambda_{1}^{d} + \lambda_{2}^{d} - \lambda_{3}^{d} - \lambda_{4}^{d}\right) + \frac{3f - 10}{\sqrt{5}}\left(\lambda_{1}^{d} - \lambda_{2}^{d}\right)\right) \in \mathcal{O}\left(\frac{2^{b}\lambda_{1}^{d}}{f}\right),$$

and  $y_d = x_{d+1} - x_d/f$ , where the asymptotic bound on  $x_d$  uses that for  $f \ge 3$  and all  $i \in [1, 4]$ ,  $\lambda_1 \ge |\lambda_i|$ . Therefore,

$$\frac{f(x_b + y_b)}{f - 2} + \sum_{d=0}^{b-1} x_d \in \mathcal{O}\left(\frac{x_b + x_{b+1}}{f}\right) + \sum_{d=0}^{\infty} x_d$$
$$= \mathcal{O}\left(\frac{\lambda_1^d x_{d+1}}{f^2}\right) + \sum_{d=0}^{\infty} x_d,$$

Observe that for  $f \geq 3$ ,  $0 \neq |\lambda_i| < 1$  for all i; thus,  $\sum_{d=0}^{\infty} \lambda_i^d = 1/(1-\lambda_i)$ , yielding with some calculation that  $\sum_{d=0}^{\infty} x_d = 2^{b+1}/(f-2)$ . Summation of the individual terms and multiplication by |C| proves claim of the theorem.  $\square$ 

As an example for the overall resulting construction, we show  $PPC^{(3)}(C, T_4)$  in Figure 7. We summarize our findings in the following theorem.

**Theorem 5.17.** Suppose that C implements  $\oplus$ , buffers have size s and depth at most d(C), and set  $\lambda_1 := (1 + \sqrt{5})/4$ . Then for all  $k \in [b+1]$ ,  $b \in \mathbb{N}_0$ , and  $f \geq 3$ , there is a  $\operatorname{PPC}_{\oplus}(2^b)$  circuit of fan-out f, depth (b+k)d(C), and size at most

$$\left(2^{b+1} + 2^{b-k} \left(2 + \frac{5f - 6}{2f^2 - 6f + 4} + \mathcal{O}\left(\frac{\lambda_1^b}{f^2}\right)\right)\right) |C| + \left(2^b + 2^{b-k-1}\right) s.$$

*Proof.* We argue as for Theorem 5.7, but replace  $PPC(C, T_b)$  by  $PPC^{(f)}(C, T_b)$ , and need to make sure that we modify the first k steps of the recursion, where we apply the construction from Figure 4a, such that the fanout is at most f. In fact, we will ensure a fanout of 2 for this part of the construction. To this end, we simply add a buffer before each output that is not directly driven by a copy of C, as already indicated in the figure. This guarantees the invariant that all inputs to and outputs of subcircuits are driven by an element not driving anything else; for the  $PPC^{(f)}(C, T_b)$  subcircuit, this invariant holds by Lemma 5.10.

This adds in total  $2^b - 2^{b-k}$  buffers to the circuit: one for each output wire minus one for each output wire of  $PPC^{(f)}(C,T_b)$ . Thus, by Theorem 5.7, the size of the circuit is bounded by  $\Delta + (2^b - 2^{b-k}) s + (2^{b+1} + 2^{b-k+1}) |C|$ , where  $\Delta := |PPC^{(f)}(C,T_b)| - |PPC(C,T_b)|$ . By Lemmas 5.12 and 5.16,  $\Delta$  is bounded by

$$2^{b-k} \left( \frac{1}{2f-2} + \frac{2}{f-2} + \mathcal{O}\left(\frac{\lambda_1^{b-k}}{f^2}\right) \right) |C| + \frac{3 \cdot 2^{b-k} s}{2} \,.$$

Summation yields the claimed bound on the size of the circuit. The depth bound and that we indeed get a  $PPC_{\oplus}(2^b)$  circuit follow as in Theorem 5.7, as the modifications to the construction affected neither its depth nor its output.

We refrain from analyzing the size of the construction for values of B that are not powers of 2. However, in Figure 8 we plot the exact bounds (without buffers) for k=0 and selected values of f against B.

# 6 SIMULATION

Separate from and in addition to the proofs from the previous sections, we verify the correctness of our circuits by VHDL simulation. To this end, we first need to specify implementations of the subcircuits computing  $\diamond_M$  and  $\operatorname{out}_M$ .



Fig. 8: Dependence of the size of the modified construction on f. For comparison, the upper bound from Corollary 5.6 on the circuit with unbounded fan-out is shown as well.

# 6.1 Gate-Level Implementation of Operators

From Tables 6a and 6b, for  $s,b\in\mathbb{B}^2$  we can extract the Boolean formulas

$$(s \diamond b)_1 = s_1 \bar{s}_2 + s_1 \bar{b}_1 + \bar{s}_2 b_1$$
  

$$(s \diamond b)_2 = \bar{s}_1 s_2 + \bar{s}_1 b_2 + s_2 \bar{b}_2$$
  

$$\text{out}(s, b)_1 = \bar{s}_1 b_2 + \bar{s}_2 b_1 + b_1 b_2$$
  

$$\text{out}(s, b)_2 = s_1 b_2 + s_2 b_1 + b_1 b_2.$$

In general, realizing a Boolean formula f by replacing negation, multiplication, and addition by inverters, AND, and OR gates, respectively, does not result in a circuit implementing  $f_{\rm M}$ . However, we can easily verify that the above formulas are disjunctions of all prime implicants of their respective functions. As shown in [10] (see also [16]), in this special case the resulting circuits do implement the closure—provided the gates behave as in Table 3, which the implementations given in Figure 1 do by Theorem 3.14. Using distributive laws (recall that these also hold in Kleene logic), the above formulas can be rewritten as

$$(s \diamond b)_1 = s_1(\bar{s}_2 + \bar{b}_1) + \bar{s}_2b_1$$
  

$$(s \diamond b)_2 = s_2(\bar{s}_1 + \bar{b}_2) + \bar{s}_1b_2$$
  

$$out(s, b)_1 = b_1(b_2 + \bar{s}_2) + b_2\bar{s}_1$$
  

$$out(s, b)_2 = b_2(b_1 + s_1) + b_1s_2.$$

We see that, in fact, a single circuit with suitably wired (and possibly negated) inputs can implement all four operations. As for  $sel_1 = \overline{sel}_2$  the circuit implements a multiplexer with select bit  $sel_1$ , we refer to it as *extended multiplexer*, or XMUX for short. Its functionality is specified by

$$XMUX(sel_1, sel_2, x, y) := y(x + sel_2) + x sel_1$$
.

Figure 9 shows the resulting circuit, and Table 11 lists how to map inputs to compute  $\diamond_{\mathbb{M}}$  and  $\operatorname{out}_{\mathbb{M}}$ .

We note that this circuit is not a particularly efficient XMUX implementation; a transistor-level implementation would be much smaller. However, our goal here is to verify correctness and give some initial indication of the size of the resulting circuits—a fully optimized ASIC circuit is beyond

- 1. For instance,  $(s \diamond b)_1 = s_1 \bar{b}_1 + \bar{s}_2 b_1$  as Boolean formula, but the two expressions differ when evaluated on  $s_1 = \bar{s}_2 = 1$  and  $b_1 = \mathbb{M}$ . The circuits resulting from the different formulas are implementations of a multiplexer (with select bit  $b_1$ ) and its closure, respectively.
- 2. Alternatively, one can manually verify that these formulas evaluate to the truth tables given in Tables 8 and 9.



Fig. 9: XMUX circuit, used to implement  $\diamond_{M}$  and  $\operatorname{out}_{M}$ .

TABLE 11: Wiring an XMUX to compute the various operators.

| $sel_1$     | $sel_2$     | x           | y     | $XMUX(sel_1,sel_2,a,b)$                             |
|-------------|-------------|-------------|-------|-----------------------------------------------------|
| $b_1$       | $\bar{b}_1$ | $\bar{s}_2$ | $s_1$ | $(s \diamond_{\scriptscriptstyle{\mathbb{M}}} b)_1$ |
| $b_2$       | $\bar{b}_2$ | $\bar{s}_1$ | $s_2$ | $(s \diamond_{\scriptscriptstyle{\mathbb{M}}} b)_2$ |
| $\bar{s}_1$ | $\bar{s}_2$ | $b_2$       | $b_1$ | $\operatorname{out}_{\mathtt{M}}(s,b)_1$            |
| $s_2$       | $s_1$       | $b_1$       | $b_2$ | $\operatorname{out}_{\mathtt{M}}(s,b)_2$            |

the scope of this article. In [4], the size of the implementation is slightly reduced by moving negations. Due to space limitations, we refrain from detailing this modification here, but note that Figure 12 and Table 12 take it into account.

# 6.2 Putting it All Together

We now have all the pieces in place to assemble a containing 2-sort(B) circuit. By Theorem 4.3,  $\diamond_{\mathbb{M}}$  is associative. Thus, from a given implementation of  $\diamond_{\mathbb{M}}$  (e.g., two copies of the circuit from Figure 9 with appropriate wiring and negation, cf. Table 11) we can construct  $\operatorname{PPC}_{\diamond_{\mathbb{M}}}(B-1)$  circuits of small depth and size, as shown in Section 5. We can combine such a circuit with an  $\operatorname{out}_{\mathbb{M}}$  implementation (again, two XMUXes with appropriate wiring and negation will do) as shown in Figure 10 to obtain our 2-sort(B) circuit.

#### 6.3 Simulation Setup

We implemented the design given in Figure 10 on register-transfer-level using the  $PPC_{\diamond_{\mathsf{M}}}(B-1)$  circuit given by Theorem 5.7 for  $k=0.^3$  Quartus by Altera is used for design entry, which in our case mainly consists of checking correct implementation. After design entry we use ModelSim by Altera for behavioral simulation. Note that we must not simulate the preprocessed Quartus output, because processing may compromise metastability-containing behavior. Instead, we simulate pure VHDL. Metastable signals are simulated using VHDL signal X, because its behavior matches the worst-case behavior assumed for M.

The correctness of this construction follows from Theorems 4.7 and 4.8, where we can plug in any  $PPC_{\diamond_{\aleph}}(B-1)$  circuit, cf. Section 5. For the circuits derived by relying on the XMUX circuit from Figure 9, we independently confirmed this via simulation.

#### 6.4 Results

For the implementation of  $PPC_{\diamond_{\mathbb{M}}}(B-1)$  we used the circuits from Theorem 5.7, i.e., we did not make use of the

3. For k>0, fan-out becomes an issue, requiring the more involved constructions provided by Theorem 5.17. However, the resulting numbers would be inaccurate, and a detailed comparison based on optimized ASIC implementations is beyond the scope of this work.



Fig. 10: Constructing 2-sort(B) from  $PPC_{\diamond_{\mathbb{M}}}(B-1)$  and  $out_{\mathbb{M}}$ .

| 0X10 | 01X1                 | 1011                                |
|------|----------------------|-------------------------------------|
| 0011 | 010X                 | 00X1                                |
| 0X10 | 01XX                 | 1011                                |
| 0011 | 01XX                 | 00X1                                |
| 0X10 | 010X                 | 1011                                |
| 0011 | 01X1                 | 00X1                                |
|      | 0011<br>0X10<br>0011 | 0011 010X<br>0X10 01XX<br>0011 01XX |

Fig. 11: Excerpt from a simulation for 4-bit inputs, where X=M. The rows show (from top to bottom) the inputs g and h, both outputs of the simple non-containing circuit, and both outputs of our design. As inputs g and h we randomly generated valid strings. Columns 1 and 3 show that the simpler design fails to implement a 2-sort(4) circuit.

extension to constant fan-out. We exhaustively checked the design from Figure 10 for B up to 12 (and all feasible k). Simulation shows that the design works correctly for several levels of recursion, e.g., when regarding B=1 and B=2 as simple base cases, B=12 implies 3 levels of recursion for both patterns. We refrained from simulating the constant fan-out construction, because it simply replicates intermediate results without changing functionality.

# 6.5 Comparison to Baseline

After behavioral simulation, we continue with a comparison of our design and a standard sorting approach  $\operatorname{Bin-comp}(B)$ . As mentioned earlier, the  $\operatorname{2-sort}(B)$  implementation given in Figure 10 is slightly optimized by pulling out a negation from the operators in every recursive step [4].

After design entry as described above, we use *Encounter RTL Compiler* for synthesis and *Encounter* for place and route. Both tools are part of the Cadence tool set and in both steps we use NanGate 45 nm Open Cell Library as a standard cell library.

Since metastability-containing circuits may include additional gates that are not required in traditional Boolean logic, Boolean optimization may compromise metastability-containing properties [3]. Accordingly, we were forced to disable optimization during synthesis of the circuits.



Fig. 12: Comparison of our solution PPC Sort to a standard non-containing one. For the latter, the unexpected delay reduction at B=16 is the result of automatic optimization with more powerful gates, which our solution does not use.

Binary Benchmark Bin-comp: In short, Bin-comp consists of a simple VHDL statement comparing two binary encoded inputs and outputting the maximum and the minimum, accordingly. It follows the same design process as 2-sort, but then undergoes optimization using a more powerful set of basic gates. For example, the standard cell library provides prebuild multiplexers. These multiplexers are used by Bin-comp, but not by 2-sort, as they are not metastability-containing. We stress that these more powerful gates provide optimized implementations of multiple Boolean functions, yet each of them is still counted as a single gate. Thus, comparing our design to the binary design in terms of gate count, area, and delay disfavors our solution. Moreover, we noticed that the optimization routine switches to employing more powerful gates when going from B=8 to B=16 (cf. Figure 12), resulting in a *decrease* of the delay of the Bin-comp implementation.

Nonetheless, our design performs comparably to the non-containing binary design in terms of delay, cf. Figure 12 and Table 12. This is quite notable, as further optimization is possible by optimizing our design on the transistor level, with significant expected gains. The same applies to gate count and area, where a notable gap remains. Recall, however, that the Bin-comp design hides complexity by using more advanced gates and does not contain metastability.

We emphasize that we refrained from optimizing the design by making use of all available gates or devising transistor-level implementations, as such an approach is tied to the utilized library or requires design of standard cells.

# 7 Conclusions

In this work, we demonstrated that efficient metastability-containing sorting circuits are possible. Our results indicate that optimized implementations can achieve the same delay as non-containing solutions, without a dramatic increase in circuit size. This is of high interest to an intended application motivating us to design MC sorting circuits: fault-tolerant high-frequency clock synchronization. Sorting is a key step in envisioned implementations (cf. [10], [15]) of the Lynch-Welch algorithm [30] with improved precision of synchronization. The complete elimination of synchronizer delay is possible due to the efficient MC sorting networks presented in this article; enabling an increment of the rate at which clock corrections are applied, significantly reducing

the negative impact of phase drift of local clock sources on the precision of the algorithm (cf. [18]).

This goal will necessitate to devise optimized ASIC implementations of our circuits. The novel PPC circuits we devised in Section 5 are an important contribution towards this end. Note that it is crucial to take into account both depth and fan-out for devising low-delay circuits. Hence, follow-up work needs to compare the existing and our novel design based on suitable metrics that take both into account to reliably predict the achieved trade-offs between delay, area, and energy consumption of circuits. Note that this is of relevance beyond the specific application of MC sorting: PPC circuits lie at the heart of adder designs, implying that even a minor improvement can have significant impact on the overall performance of computing devices!

MC Control Loops: More generally speaking, MC circuits like those presented here are of interest in mixed-signal control loops whose performance depends on very short response times. When analog control is not desirable, traditional solutions incur synchronizer delay before being able to react to any input change. Using MC logic saves the time for synchronization, while metastability of the output corresponds to the initial uncertainty of the measurement; thus, the same quality of the computational result can be achieved in shorter time. Note that our circuits are purely combinational, so they can be used in both clocked and asynchronous control logic.

Obvious examples of such control loops are clock synchronization circuits, but MC has been shown to be useful for adaptive voltage control [13] and fast routing with an acceptable low probability of data corruption [29] as well. This type of application suggests to explore whether efficient circuits exist for a wider range of arithmetic operations, like e.g. addition or (possibly approximate) multiplication.

Redundant Encoding and Addition: On the theoretical side, our results are to be contrasted with the exponential gap between the size of non-containing and MC circuits shown in [17]. This work raised the question for which classes of functions small MC circuits exist. Given that Ladner and Fischer proved that the PPC task can be solved efficiently for any constant-sized state machine [23], it was natural to ask whether this result can be extended to MC computations. In follow-up work, we show that indeed this holds true for any constant-sized FSM [5]. However, when applying this result to addition, unlike for sorting (where the underlying operations are max and min) uncertainty of inputs adds up. This means that Gray code can support meaningful computations only if the *total* uncertainty of all addends is at most 1.

Accordingly, in [5] we also consider redundant encodings, showing that using k (roughly) redundant bits, an uncertainty of  $\lfloor (k+1)/2 \rfloor$  can be tolerated without loss of precision. Combined with the above result on transducers, this yields a meaningful notion of MC addition that allows for efficient circuits. As, essentially, the redundant bits are used as a unary code, it should be straightforward to apply the techniques from this article to obtain efficient sorting circuits with the encoding from [5]. We remark that the encoding from [5] turns out to be identical to that of the output of suitable time-to-digital converters [12], so relaxing their output constraints to achieve better average-case

TABLE 12: Simulation results for metastability-containing sorting networks with  $n \in \{4,7,10\}$  for B-bit inputs.  $10\text{-sort}_{\#}$ optimizes gate count [7], 10-sort<sub>d</sub> optimizes depth [6]; for  $n \in \{4,7\}$ , the sorting networks are optimal w.r.t. both measures. Simulation results are: (i) number of gates, (ii) postlayout area  $[\mu m^2]$  and (iii) prelayout delay [ps].

| В  | Circuit  | 4-sort |          |       | 7-sort |          | 10-sort <sub>#</sub> |       |          | 10-sort <sub>d</sub> |       |           |       |
|----|----------|--------|----------|-------|--------|----------|----------------------|-------|----------|----------------------|-------|-----------|-------|
|    |          | gates  | area     | delay | gates  | area     | delay                | gates | area     | delay                | gates | area      | delay |
|    | our work | 65     | 87.402   | 357   | 208    | 279.741  | 714                  | 377   | 506.912  | 912                  | 403   | 541.968   | 833   |
| 2  | Bin-comp | 40     | 77.91    | 478   | 128    | 249.326  | 953                  | 232   | 451.815  | 1284                 | 248   | 483       | 1145  |
|    | our work | 275    | 368.641  | 640   | 880    | 1179.528 | 1014                 | 1595  | 2137.905 | 1235                 | 1705  | 2285.514  | 1133  |
| 4  | Bin-comp | 95     | 172.935  | 906   | 304    | 553.28   | 1810                 | 551   | 1002.848 | 2429                 | 589   | 1072.099  | 2143  |
|    | our work | 845    | 1136.184 | 1396  | 2704   | 3636.08  | 1921                 | 4901  | 6590.283 | 2179                 | 5239  | 7044.541  | 2059  |
| 8  | Bin-comp | 205    | 368.641  | 1475  | 656    | 1179.528 | 2948                 | 1189  | 2137.905 | 3945                 | 1271  | 2285.514  | 3470  |
|    | our work | 2035   | 2739.961 | 2069  | 6512   | 8767.374 | 3396                 | 11803 | 15891.12 | 4030                 | 12617 | 16987.194 | 3844  |
| 16 | Bin-comp | 405    | 530.67   | 1298  | 1296   | 2425.99  | 2600                 | 2349  | 4397.085 | 3474                 | 2511  | 4700.304  | 3050  |

performance would provide valid input for sorting circuits that accept inputs encoded in this manner.

We believe that these results suggest applicability of our techniques to a wide range of mixed-signal control loops and call for future work further exploring to which extend basic arithmetics can be realized by efficient MC circuits.

Acknowledgments: We thank Attila Kinali and the anonymous reviewers for valuable input. This project has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement 716562).

#### REFERENCES

- M. Ajtai, J. Komlós, and E. Szemerédi. An  $\mathcal{O}(n \log n)$  Sorting Network. In STOC, 1983.
- R. P. Brent and H. T. Kung. A Regular Layout for Parallel Adders. Transactions on Computers, C-31(3):260-264, 1982.
- Johannes Bund, Christoph Lenzen, and Moti Medina. Near-Optimal Metastability-Containing Sorting Networks. In DATE,
- Johannes Bund, Christoph Lenzen, and Moti Medina. Optimal Metastability-Containing Sorting Networks. In DATE, 2018.
- Johannes Bund, Christoph Lenzen, and Moti Medina. Hazard-free Transducers. CoRR, abs/1811.12369, 2018.
- Daniel Bundala and Jakub Závodný. Optimal sorting networks.
- In LATA, pages 236–247. Springer, 2014. Michael Codish, Luís Cruz-Filipe, Michael Frank, and Peter Schneider-Kamp. 25 comparators is optimal when sorting 9 inputs (and 29 for 10). In ICTAI, 2014.
- T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms, 3rd Edition. MIT Press, 2009.
- Stephan Friedrichs. Metastability-Containing Circuits, Parallel Distance Problems, and Terrain Guarding. PhD thesis, Saarland University, Saarbrücken, Germany, 2017.
- [10] Stephan Friedrichs, Matthias Függer, and Christoph Lenzen. Metastability-Containing Circuits. Transactions on Computers, 67,
- [11] Stephan Friedrichs and Attila Kinali. Efficient Metastability-Containing Multiplexers. In ISVLSI, pages 332-337, 2017.
- [12] Matthias Függer, Attila Kinali, Christoph Lenzen, and Thomas Polzer. Metastability-aware Memory-efficient Time-to-Digital Converters. In ASYNC, 2017.

- [13] Matthias Függer, Attila Kinali, Christoph Lenzen, and Ben Wiederhake. Fast All-Digital Clock Frequency Adaptation Circuit for Voltage Droop Tolerance. In ASYNC, pages 68-77, 2018.
- [14] R. Ginosar. Metastability and Synchronizers: A Tutorial. Design Test of Computers, 28(5):23-35, 2011.
- [15] Florian Huemer, Attila Kinali, and Christoph Lenzen. Faulttolerant Clock Synchronization with High Precision. In ISVLSI,
- [16] D. A. Huffman. The Design and Use of Hazard-Free Switching Networks. JACM, 4(1):47-62, 1957.
- [17] C. Ikenmeyer, B. Komarath, C. Lenzen, V. Lysikov, A. Mokhov, and K. Sreenivasaiah. On the complexity of hazard-free circuits.
- In STOC, pages 878–889, 2018.
  [18] P. Khanchandani and C. Lenzen. Self-Stabilizing Byzantine Clock Synchronization with Optimal Precision. Theory of Computing Systems, 2018.
- [19] David J. Kinniment. Synchronization and Arbitration in Digital Systems. Wiley Publishing, 2008.
- Stephen Cole Kleene. Introduction to Metamathematics. North Holland, 1952.
- [21] Donald E. Knuth. The Art of Computer Programming Vol. 3: Sorting and Searching, 1998.
- [22] P. M. Kogge and H. S. Stone. A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations. Transactions on Computers, C-22(8):786-793, 1973.
- [23] Richard E Ladner and Michael J Fischer. Parallel Prefix Computation. JACM, 27(4):831-838, 1980.
- Christoph Lenzen and Moti Medina. Efficient metastabilitycontaining gray code 2-sort. In ASYNC, pages 49-56, 2016.
- [25] Leonard Marino. General Theory of Metastable Operation. Transactions on Computers, C-30(2):107-115, 1981.
- [26] J. Sklansky. Conditional-Sum Addition Logic. IRE Transactions on Electronic Computers, EC-9(2):226-231, 1960.
- Earl E. Swartzlander and Carl E. Lemonds. Computer Arithmetic, volume 1. World Scientific, 2015.
- [28] G. Tarawneh and A. Yakovlev. An RTL Method for Hiding Clock Domain Crossing Latency. In ICECS, pages 540-543, 2012.
- [29] Ghaith Tarawneh, Matthias Függer, and Christoph Lenzen. Metastability Tolerant Computing. In ASYNC, pages 25–32, 2017.
- Jennifer Lundelius Welch and Nancy A. Lynch. A New Fault-Tolerant Algorithm for Clock Synchronization. Information and Computation, 77(1), 1988.
- [31] Reto Zimmermann. Binary Adder Architectures for Cell-Based VLSI and their Synthesis. PhD thesis, ETH Zurich, 1997.