### Appendix A: Number of elements

### 1.1 A.1 General unitary

The path-encoded unitary \(U_{P}\) as well as the OAM sorter are implemented as networks of many interferometers. Beam splitters thus play an important role in their construction. In the following, we will estimate the complexity of the OAM implementation \(U_{O}\) by counting the beam splitters used in its construction. A similar discussion can also be done for other optical elements. Let \(N_{O}(d)\) be the number of beam splitters required in the scheme of Eq. (3). If \(N_{P}(d)\) is the number of beam splitters that are necessary to implement \(U_{P}\), then

$$\begin{aligned} N_{O}(d) = N_{P}(d) + 2 N_{S }(d), \end{aligned}$$

(21)

where

$$ N_{S }(d) = 2 \, (d-1) $$

(22)

is the number of beam splitters that implement the OAM sorter in dimension *d* [25]. The upper bound for \(N_{P}(d)\) is provided by Reck *et al.* scheme, for which \(N_{P}(d) = d(d-1)/2\).

### 1.2 A.2 \({X}^{k}\) gates

The number of beam splitters required in a scheme for the *d*-dimensional *X* gate, where \(d = 2^{M}\), is equal to [26]

$$ N_{{X}}(d) = 4 \log _{2}(d). $$

(23)

Following the analysis in the main text, it is not hard to see that the number of beam splitters necessary to implement \({X}^{k}\), where \(d = 2^{M}\) and \(k = 2^{m}\), is

$$\begin{aligned} N_{{X}}\bigl(d, k = 2^{m}\bigr) = & k \, N_{{X}}(d/k) + 2 \, N_{S }(k) \end{aligned}$$

(24)

$$\begin{aligned} = & 4 \, \biggl(k \, \log _{2} \biggl(\frac{d}{k} \biggr) + k - 1 \biggr). \end{aligned}$$

(25)

It is easy to observe that for \(k = d/2 = 2^{M-1}\) there are no exchangers to be removed from the naive scheme, see Fig. 1(g), since the path permutation affects all the outermost exchangers in both OAM sorters. Indeed, in such a case \(N_{{X}}(d, k = d/2) = 2 \, N_{S }(d)\). On the other hand, for \(k = 1 = 2^{0}\) the number of required beam splitters is minimal and (25) coincides with \(N_{{X}}(d)\) (23).

The structure of the network for powers *k* that are not powers of two is more complicated. Even though the same procedure as delineated in the main text can be followed, we refrain from describing it in detail and present the resulting number of beam splitters retained in the simplified setup. It turns out that this number for \({X}^{k}\) gates in dimension \(d = 2^{M}\) and for general powers \(1 \le k \le d/2\) is equal to

$$ N_{{X}}(d,k) = 4 \, \biggl(k \, \log _{2} \biggl( \frac{d}{2^{m+1}} \biggr) + 2^{m+1} - 1 \biggr), $$

(26)

where *m* is an integer such that \(2^{m} \le k < 2^{m+1}\). For \(k = 2^{m}\) we recover formula (25).

From the structure of the OAM exchanger, Fig. 1(a), and the fact that the resulting setup for any \({X}^{k}\) gate consists only of the OAM exchangers, it is clear that the exact same formula (26) applies also to the number of employed Dove prisms and holograms. For the number of mirrors we get twice as large a number and there is no need for phase shifters.

### 1.3 A.3 Parallelized scheme

Let us denote by \(N_{O}^{(\mathrm{par})}\) the number of beam splitters employed in the parallelized scheme of Eq. (14). Analogously to Eq. (21) we obtain

$$\begin{aligned} N_{O}^{(\mathrm{par})}(n,d) = N_{P}(d) + 2 \, N_{\mathrm{SWAP}}(n,d), \end{aligned}$$

(27)

where \(N_{\mathrm{SWAP}}(n,d)\) is the number of beam splitters used to implement the swap operator with *n* input and *d* output paths. Due to the structure of the swap operator [25] we have to discuss the case with \(n \leq d\) and that with \(n > d\) separately. When both *n* and *d* are powers of two and \(n \leq d\), the number of beam splitters that implement the swap operator is given by [12]

$$ N_{\mathrm{SWAP}}(n,d) = \frac{n}{2} \log _{2}(n) + d \log _{2}(n) - 3 n + 2 d + 1. $$

(28)

In the opposite case with \(n \geq d\) one obtains

$$ N_{\mathrm{SWAP}}(n,d) = \frac{n}{2} \log _{2}(n) + d \log _{2}(d) + n - 2 d + 1. $$

(29)

When *n* or *d* are not powers of two, we construct the swap with \(2^{r}\) input and \(2^{s}\) output paths, where *r* and *s* are such that \(2^{r-1} < n \leq 2^{r}\) and \(2^{s-1} < d \leq 2^{s}\). Formulas (28) and (29) then represent upper bounds on the number of utilized beam splitters.

To compare the performance of the parallelized scheme with the naive approach, let us use the ratio *r* defined in Eq. (15) and consider a general unitary implemented with Reck *et al.* scheme [17]. For this scheme the ratio (15) scales roughly as

$$ r_{\mathrm{Reck}}(n, d) \sim \frac{1}{n} + \frac{2 \log _{2}(n)}{d^{2}}. $$

(30)

Even though the number of beam splitters in the swaps differs for the case of \(n \leq d\) and that of \(n > d\), the scaling (30) holds approximately for both of them. A sample of exact values of ratio \(r_{\mathrm{Reck}}\) as well as the asymptotic behavior are depicted in Fig. 6. Permutations serve as a counterpart of this general scheme as far as the number of beam splitters is considered, see \(r_{\mathrm{perm.}}\) in Eq. (17). For comparison, a sample of exact values of \(r_{\mathrm{perm.}}\) as well as the asymptotic behavior (17) are also depicted in Fig. 6.

As for the other optical elements involved in the parallelized setup built using Reck *et al.* scheme, the following estimates apply. The path-only unitary implementation requires \(O(d^{2})\) beam splitters and \(O(d^{2})\) phase shifters. To build the swap operators there are \(O(n \log _{2}(n))\) beam splitters, \(O(n)\) phase shifters, \(O(n \log _{2}(n))\) holograms, and \(O(n)\) Dove prisms necessary, provided that \(n > d\) [12]. When \(n \leq d\), the estimates only depend all on *d*, not on *n*.

### 1.4 A.4 Parallelized \({X}^{k}\) gates

The resulting number of beam splitters required to implement the parallelized version of the *X* gate in dimension \(d = 2^{M}\) for \(n = 2^{K}\) paths is equal to

$$ N_{{X}}^{(\mathrm{par})}(n, d) = n \log _{2}(n) + 2 \, n - 2, $$

(31)

provided that \(n \geq d\). This formula does not depend on the dimension *d*, only on the number of paths. The naive approach consisting in stacking *n* non-parallelized schemes (23) would require \(N_{{X}}(d) \, n = 4 n \log _{2}(d)\) beam splitters. The saving in resources is thus approximately equal to

$$ r_{{X}}(n, d) \approx \frac{\log _{2}(n)}{4 \log _{2}(d)} $$

(32)

for large enough dimensions *d* and number of paths \(n \geq d\). When the number of paths is approximately equal to the dimension, the ratio above approaches a constant factor of \(1/4\) and the parallelization of Eq. (14) provides a moderate improvement over the naive approach. When \(n \leq d\), the formula (31) is modified, but even then the improvement resulting from the parallelized scheme is rather moderate.

By calculations analogous to those for the non-parallelized powers of \({X}^{k}\) gates, the number of beam splitters retained in the final implementation of the parallelized schemes for an arbitrary \(k \leq d/2\) turns out to be

$$ N_{{X}}^{(\mathrm{par})}(n, d, k) = n \log _{2}(n) + 2 n - 4 k + 2 + 2 d \biggl( \frac{k}{2^{m}} + m - 1 \biggr), $$

(33)

where *m* in an integer such that \(2^{m} \le k < 2^{m+1}\) and where we assume \(n \geq d\). This expression simplifies for \(k = 1\) into the formula (31) derived for the parallelized version of the *X* gate. A similar discussion can also be done for other optical elements with similar results and for \(n < d\).

When we compare the scaling for the naive approach utilizing *n* identical copies of the \({X}^{k}\) gate and the parallelization of Eq. (14), we obtain a scaling ratio that approaches

$$ r_{{X}}(n, d, k) \sim \frac{1}{4 \, k} \frac{\log _{2}(n)}{\log _{2}(d)}, $$

(34)

where again \(n \geq d\). For high powers *k* we, therefore, save more resources by making use of the parallelized version. In this formula, we assumed *k* to be constant. We can, however, also consider *k* that scales with the dimension *d*. For instance, the most resource-demanding scenario is when \(k = d/2\). In such a case one obtains

$$ r_{{X}}(n, d, d/2) \lesssim \frac{3\log _{2}(n)}{4d}. $$

(35)

Unless the number of paths exceeds exponentially the dimension, the parallelized scheme of Eq. (14) offers in this scenario substantial savings in resources when compared to the naive approach. For \(n < d\) one can perform an analogous analysis.

### Appendix B: Losses

We can make some rough estimates of the losses of setups presented in the main text by adopting the following simplifications. There are many sources of errors, such as possible beam distortions due to Dove prisms, non-unit conversion efficiency of holograms, different splitting ratios of beam splitters, as well as imperfect reflection of mirrors. Let us assume that all these errors can be modelled as losses quantified by effective mean transmittance *T* of each optical element where we omit the phase shifters as these can be implemented by a mere path length difference in an interferometer. We also assume that all OAM modes are affected the same way and that \(n = d\) in the parallelized scheme.

The number of elements that a photon has to traverse in the universal scheme of Eq. (3) equals approximately \(L(d) = d + 10 \log _{2}(d)\). The transmittance of this scheme thus equals \(T^{L(d)}\) and is, therefore, of the same order of magnitude as the transmittance of the scheme of Ref. [18] for purely path-encoded unitaries, in which each photon traverses *d* elements.

The naive implementation of the parallel transformation \(U_{O}^{\mathrm{(par)}}\) with \(n = d\) consists of *d* copies of the universal scheme. When a photon is launched into each of them, there is a chance of \(T^{d \, L(d)} = T^{d^{2} + 10 d \log _{2}(d)}\) that all photons make it through the setup and no photon is lost. The parallelized scheme of Eq. (14) has a more complicated structure, where a photon launched into the *j*-th port propagates through a different number \(L_{j}(d)\) of elements. The probability that no photon is lost is given by \(T^{L_{0}(d)} T^{L_{1}(d)} \ldots T^{L_{d-1}(d)} = T^{\sum _{j} L_{j}(d)}\), where the exponent reads \(\sum_{j=0}^{d-1} L_{j}(d) = d^{2} + 12 d \log _{2}(d)\). The simultaneous transmission of *d* photons through the parallelized scheme is thus quantified by transmittance of \(T^{d^{2} + 12 d \log _{2}(d)}\), which differs by a factor of \(T^{2 d \log _{2}(d)}\) from the naive scheme. The per-photon transmittance is thus decreased by a factor of \(T^{2 \log _{2}(d)}\) for the parallelized scheme. If we assume that the effective mean transmittance of each element is \(T = 0.9\), this factor does not drop below 0.43 for dimensions up to \(d = 16\).

Let us note that all the schemes for implementing unitaries in OAM presented above share with the purely path-encoded schemes [17–19] the fact that the overall transmittance drops down exponentially with the dimension *d*, which is of a particular concern in the real-world experimental implementations.

### Appendix C: Periodicity in OAM

The OAM subspace is usually defined as a linear span of eigenstates \(\vert 0 \rangle _{O}\), \(\vert 1 \rangle _{O}\), \(\vert 2 \rangle _{O}\), …, \(\vert d-1 \rangle _{O}\) for a fixed dimension *d*. This subspace can be used in the universal scheme of Eq. (3). When we want to make use of the periodicity in OAM, the dimension has to be of the form \(d = 2^{M}\) in order that the interferometric implementation of OAM sorters and swap operators works properly for higher-order eigenstates [12]. For the power-of-two dimensions, the OAM sorter is shown [25] to act like

$$ S_{d}\bigl( \vert m \rangle _{O} \vert 0 \rangle _{P}\bigr) = \biggl\vert d \cdot \biggl\lfloor \frac{m}{d} \biggr\rfloor \biggr\rangle _{O} \vert m \operatorname {mod}d \rangle _{P}. $$

(36)

This ‘modulo property’ makes sure that OAM eigenstates of the form \(\vert 0 + a d \rangle _{O}\), \(\vert 1 + a d \rangle _{O}\), \(\vert 2 + a d \rangle _{O}\), …, \(\vert d-1 + a d \rangle _{O}\) for some \(a \in \mathbb{Z}\) are not mixed with eigenstates from other OAM subspaces. When an eigenstate \(\vert m + a d \rangle _{O}\) enters the OAM sorter, it gets transformed into \(\vert a \, d \rangle _{O}\) propagating along the *m*-th output path. All the aforementioned eigenstates thus leave the OAM sorter along *d* different paths, but all of them are at that moment equal to \(\vert a \, d \rangle _{O}\). This way, the path-only implementation \(U_{P}\) then mixes only the terms that correspond to the same subspace, which results in the parallelized operation in the OAM degree of freedom.

From the implementation of the swap operator it follows that whenever the number *n* of input paths exceeds the number *d* of output paths, only a specific class of incoming OAM eigenstates gets swapped correctly. Specifically, only eigenstates of the form \(\vert 0 \rangle _{O}, \vert n/d \rangle _{O}, \vert 2 n/d \rangle _{O}, \ldots , \vert k \, n/d \rangle _{O}\) for \(k \in \mathbb{Z}\) can then be used in our parallelized scheme. For these eigenstates, the action of the interferometric implementation of the swap can be summarized as [12]

$$ \mathrm{SWAP}_{n, d} \biggl( \biggl\vert \frac{n}{d} \cdot m \biggr\rangle _{O} \vert p \rangle _{P} \biggr) = \biggl\vert n \cdot \biggl\lfloor \frac{m}{d} \biggr\rfloor + p \biggr\rangle _{O} \vert m \operatorname {mod}d\rangle _{P}. $$

(37)

Since we work only with such *n* and *d* that are powers of two, their ratio \(n/d\) is an integer. The above formula is a generalization of the ‘modulo property’ for the swap operator. If \(n \leq d\), the action of the swap operator can be written like

$$ \mathrm{SWAP}_{n, d} \bigl( \vert m \rangle _{O} \vert p \rangle _{P} \bigr) = \biggl\vert \frac{d}{n} \cdot \biggl( n \cdot \biggl\lfloor \frac{m}{d} \biggr\rfloor + p \biggr) \biggr\rangle _{O} \vert m \operatorname {mod}d \rangle _{P}. $$

(38)