 Research
 Open access
 Published:
Reinforcement learning assisted recursive QAOA
EPJ Quantum Technology volume 11, Article number: 6 (2024)
Abstract
In recent years, variational quantum algorithms such as the Quantum Approximation Optimization Algorithm (QAOA) have gained popularity as they provide the hope of using NISQ devices to tackle hard combinatorial optimization problems. It is, however, known that at low depth, certain locality constraints of QAOA limit its performance. To go beyond these limitations, a nonlocal variant of QAOA, namely recursive QAOA (RQAOA), was proposed to improve the quality of approximate solutions. The RQAOA has been studied comparatively less than QAOA, and it is less understood, for instance, for what family of instances it may fail to provide highquality solutions. However, as we are tackling NPhard problems (specifically, the Ising spin model), it is expected that RQAOA does fail, raising the question of designing even better quantum algorithms for combinatorial optimization. In this spirit, we identify and analyze cases where (depth1) RQAOA fails and, based on this, propose a reinforcement learning enhanced RQAOA variant (RLRQAOA) that improves upon RQAOA. We show that the performance of RLRQAOA improves over RQAOA: RLRQAOA is strictly better on these identified instances where RQAOA underperforms and is similarly performing on instances where RQAOA is nearoptimal. Our work exemplifies the potentially beneficial synergy between reinforcement learning and quantum (inspired) optimization in the design of new, even better heuristics for complex problems.
1 Introduction
As quantum computing is becoming practical [1–4], there has been a growing interest in employing nearterm quantum algorithms to help solve problems in quantum chemistry [5], quantum machine learning [6], and combinatorial optimization [7]. Any such nearterm algorithm must consider the primary restrictions of Noisy Intermediate Scalable Quantum (NISQ) devices; e.g., the number of qubits, decoherence etc. Variational Quantum Algorithms (VQAs) such as the Quantum Approximation Optimization Algorithm (QAOA) [7] were developed as a potential approach to achieve a quantum advantage in practical applications keeping in mind these design restrictions.
For a userspecified input depth l, QAOA consists of a quantum circuit with 2l variational parameters. In the limit of infinite depth, for optimal parameters, the solution of QAOA converges to the optimum for a given combinatorial optimization problem [7]. However, a significant body of research has produced negative results [8–15] for QAOA limited to logarithmic depth (in the number of qubits), exploiting the notion of locality or symmetry in QAOA. This motivates the study of techniques that circumvent the restriction of locality or symmetry in QAOA, which exploit the informationprocessing capabilities of lowdepth quantum circuits by employing classical nonlocal preand postprocessing steps.^{Footnote 1}
One such proposal is the recursive QAOA (RQAOA), a nonlocal variant of QAOA, which uses shallow depth circuits of QAOA iteratively, and at every iteration, the size of the problem (usually expressed in terms of a graph or a hypergraph) is reduced by one (or more). The elimination procedure introduces nonlocal effects via the new connections between previously unconnected nodes, which counteracts the locality restrictions of QAOA. The authors in [11, 16, 17] empirically show that depth1 RQAOA always performs better than depth1 QAOA and is competitive to best known classical algorithms based on rounding of a semidefinite programming relaxation for Ising and graph colouring problems. However, given that these problems are NPhard, there must also exist instances that RQAOA fails to solve exactly, unless \(\mathsf{NP} \subseteq \mathsf{\mathsf{BQP}}\). Hence, to further push the boundaries of algorithms for combinatorial optimization on NISQ devices (and beyond), it is helpful to determine when RQAOA fails, as this can aid in developing better variants of RQAOA.
In this work, we study extensions of RQAOA, which perform better than RQAOA for the Ising problem (or equivalently, the weighted MaxCut problem, where the external field is zero, refer to Sect. 2.2). We do this by identifying cases where RQAOA fails (i.e., find smallscale instances with approximation ratio ≤0.95). Then, we analyze the reasons for this failure and, based on these insights, we modify RQAOA. We employ reinforcement learning (RL) to not only tweak RQAOA’s selection rule, but also train the parameters of QAOA instead of using energyoptimal ones in a new algorithm that we call RLRQAOA. In particular, the proposed hybrid algorithm provides a suitable testbed for assessing the potential benefit of RL: we perform simulations of (depth1) RQAOA, and RLRQAOA on an ensemble of randomly generated weighted dregular graphs and show that RLRQAOA consistently outperforms its counterparts. In the proposed algorithm, the RL component itself plays an integral role in finding the solution, so this raises the question of the actual role of the QAOA circuit and thus potential quantum advantages. To show that the QAOA circuits have a nontrivial contribution to the advantage, we compare RLRQAOA to an entirely classical RL agent (which, given exponential time, imitates a brute force algorithm) and show that RLRQAOA converges both faster and to better solutions than the simple classical RL agents. We note that our approach to enhance RQAOA’s performance is not limited to depth1 and can be straightforwardly extended to higher depths.
We present our results as follows: Sect. 2 introduces QAOA, recursive QAOA (RQAOA), and fundamental concepts behind policy gradient methods in RL. Section 3 presents related works. Section 4 describes the limitations of RQAOA, and we illustrate their validity by performing numerical simulations. In Sect. 5, we provide a sketch of the policies of RLRQAOA (quantumclassical) and RLRONE (classical, introduced to characterize the role of quantum aspects of the algorithm) and their learning algorithms. Section 6 presents our computational results for the comparison between classical and hybrid algorithms (RQAOA, RLRQAOA, and RLRONE) on an ensemble of Ising instances. Finally, we conclude with a discussion in Sect. 7.
2 Background
In this section, we first provide a brief overview of QAOA (Sect. 2.1) and its classical simulatability for the Ising problem (Sect. 2.2). Later, we introduce recursive QAOA (RQAOA) (Sect. 2.3) upon which we base our proposal for RLenhanced RQAOA and introductory concepts behind policy gradient in RL (Sect. 2.4). These notions will give us tools to develop policies based on the QAOA ansatz and their learning algorithms in the upcoming sections.
2.1 Quantum approximate optimization algorithm
QAOA seeks to approximate the maximum of the binary cost function \(\mathcal{C}: \{0,1\}^{n} \rightarrow \mathbb{R}\) encoded into a Hamiltonian as \(H_{n} = \sum_{x \in \{0,1\}^{n}} \mathcal{C}(x) { \vert {x} \rangle }{ \langle{x} \vert }\). Starting from an initial state \({ \vert {s} \rangle } = { \vert {+^{n}} \rangle }\) (uniform superposition state), QAOA alternates between two unitary evolution operators \(U_{p}(\gamma ) = \exp (i \gamma H_{n})\) (phase operator) and \(U_{m}(\alpha ) = \exp (i \alpha H_{b})\) (mixer operator) respectively, where \(H_{b} = \sum_{j=1}^{n} X_{j}\). Hereafter, X, Y, Z are standard Pauli operators and \(P_{j}\) is a Pauli operator acting on qubit j for \(P \in \{X, Y, Z\}\). The phase and mixer operator are typically applied a total of l times, generating the quantum state,
where the variational parameters \(\{\vec{\alpha}, \vec{\gamma}\} \in [0,2 \pi ]^{2l}\) and the integer l is called the QAOA depth. The depth l controls the nonlocality of the QAOA circuit. During the operation of QAOA, these parameters are tuned to optimize the expected value of \(H_{n} := { \langle{\Psi _{l}(\vec{\alpha }, \vec{\gamma })} \vert } H_{n} { \vert {\Psi _{l}(\vec{\alpha }, \vec{\gamma })} \rangle }\). The preparation of the state (1) is followed by a measurement in the computational basis, which outputs a bitstring x corresponding to a candidate solution of the cost function \(\mathcal{C}\). The probability \(\mathbb{P}_{l}(x)\) of obtaining a bitstring \(x \in \{0,1\}^{n}\) is given by Born’s rule,
A candidate bitstring \(x^{*}\) is called an rapproximation solution to a given instance, for \(0 \leq r \leq 1\) if,
An algorithm is said to achieve an approximation ratio of r for a cost function \(\mathcal{C}\) if it returns an rapproximation or better for every problem instance in the class (i.e., in the worst case).
We say that depthl QAOA achieves an approximation ratio of r for a problem instance of a cost function \(\mathcal{C}\) if there exists parameters \(\{\vec{\alpha}, \vec{\gamma}\}\) such that
We note that repeating a sequence of state preparations and measurements approximates the distribution of x given by (2) and that (4) is the mean of this distribution. The candidate bitstring \(x^{*}\) may then be selected to yield the maximum approximation ratio r.
2.2 Classical simulatability of QAOA for the Ising problem
Next, we review the classical simulatability of a paradigmatic case of QAOA for the Ising problem. This is a core building block for simulating both (depth1) RQAOA and RLRQAOA. It enables their efficient classical simulation at depth1 for arbitrary graphs. Given a graph \(G_{n} = (V,E)\) with n vertices \(V = [n]\) (where \([n] = \{1, 2,\ldots , n\}\)) and edges \(E \subset V \times V\), as well as an external field \(h_{u} \in \mathbb{R}\) and a coupling coefficient (edge weight) \(J_{uv} \in \mathbb{R}\) associated with each vertex and edge respectively, then the Ising problem aims to find a spin configuration \(s \in \{1, +1\}^{n}\) maximizing the cost Hamiltonian,^{Footnote 2}
The Ising problem without any external field is equivalent to the weighted MaxCut problem, where the goal is to find a bipartition of vertices such that the total weight of the edges between them is maximized. The expected value of each Pauli operator \(Z_{u}\) and \(Z_{u}Z_{v}\) on depth1 QAOA can be computed classically in \(O(n)\) time using analytical results stated in Theorem 1 in Appendix A. Since the cost function has \(O(n^{2})\) many terms in the worst case, computing the final expected value of (5) hence takes a total time in \(O(n^{3})\) given the variational parameters.
2.3 Recursive QAOA
In this subsection, we outline the RQAOA algorithm of Bravyi et al. [11] for the Ising problem as defined in (5) with no external fields (\(h_{u} = 0\), \(\forall u \in V\)). This will serve as a base for our proposal of RLenhanced RQAOA. The RQAOA algorithm aims to approximate the maximum expected value^{Footnote 3}\(\max_{x} { \langle{x} \vert } H_{n} { \vert {x} \rangle }\), where \(x \in \{0, 1\}^{n}\). It consists of the following steps. First, a standard depthl QAOA is executed to find the quantum state \({ \vert {\Psi ^{*}_{l}(\vec{\alpha }, \vec{\gamma })} \rangle }\) (with optimal variational parameters) as in (1) that maximizes the expectation value of \(H_{n}\). For each edge \((u, v) \in E\), the twocorrelation \(M_{u, v} = { \langle{\Psi ^{*}_{l}(\vec{\alpha }, \vec{\gamma })} \vert } Z_{u}Z_{v} { \vert {\Psi ^{*}_{l}(\vec{\alpha }, \vec{\gamma })} \rangle }\) is computed. A variable \(Z_{u}\) with largest \(M_{u, v}\) is then eliminated (breaking ties arbitrarily) by imposing the constraint
which yields a new Ising Hamiltonian \(H_{n1}\) with at most \(n1\) variables. The resulting Hamiltonian is processed iteratively, following the same steps. Finally, this iterative process stops once the number of variables is below a predefined threshold \(n_{c}\). The remaining Hamiltonian with \(n_{c}\) variables can then be solved using a classical algorithm (e.g., brute force method). The final solution can then be obtained iteratively by reconstructing eliminated variables using (6).
We note that the variable elimination scheme in RQAOA is analogous to rounding solutions obtained by solving continuous relaxations of combinatorial optimization problems. We refer the interested reader to [18, Sec. V.A.] for a detailed discussion on the connection between quantum optimization algorithms and classical approximation algorithms. Recall that the final expected value of \(H_{n}\) as in (5) can be computed in \(O(n^{3})\) time. Since we can choose \(n_{c}\) such that \(n_{c} \approx O(1)\), RQAOA runs for approximately n iterations, so that the total running time is \(O(n^{4})\) (neglecting the running time needed to find the optimal variational parameters).
2.4 Reinforcement learning primer
As our proposal to improve upon RQAOA is based on reinforcement learning, we introduce basic concepts behind RL and the policy gradient method in this subsection.
In RL, the agent learns an optimal policy by interacting with its environment using a trialanderror approach [19]. Formally, RL can be modeled as a Markov Decision Process (MDP) defined by the tuple \((\mathcal{S}, \mathcal{A}, p, R)\), where \(\mathcal{S}\) and \(\mathcal{A}\) represent the state and action spaces (both can be continuous and discrete), the function \(p: \mathcal{S} \times \mathcal{S} \times \mathcal{A} \rightarrow [0,1]\) defines the transition dynamics, and \(R: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}\) describes the reward function of the environment. An agent’s behaviour is governed by a stochastic policy \(\pi _{\theta}(as) : \mathcal{S} \times \mathcal{A} \rightarrow [0, 1]\), for \(a \in \mathcal{A}\) and \(s \in \mathcal{S}\). Highly expressive function approximators, such as deep neural networks (DNN), can be used to parametrize a policy \(\pi _{\theta}\) using tunable parameters \(\theta \in \mathbb{R}^{d}\). An agent’s interaction governed by a policy \(\pi _{\theta}(as)\) in the environment can be viewed as sampling a trajectory \(\tau \sim \mathbb{P}_{E}(\cdot )\) from the MDP, where
is the probability of the trajectory τ of length H to occur, where \(p_{0}\) is a distribution of initial state \(s_{0}\). An example of a trajectory is
An agent collects a sequence of rewards based on its interactions with the environment. The metric that assesses an agent’s performance is called the value function \(V_{\pi _{\theta}}\) and takes the form of a discounted sum as follows,
where \(s_{0}\) is an initial state of an agent’s trajectory τ within an environment, \(\mathbb{P}_{E}\) describes the environment dynamics (i.e., in the form of an MDP)), and \(r_{t}\) is the reward at time step t during the interaction. Every trajectory has a horizon (length) \(H \in \mathbb{N} \cup \{\infty \}\) and the expected return involves a discounting factor \(\gamma \in [0,1]\). Most often one chooses \(\gamma < 1\) to avoid unwanted diverging value functions for a horizon \(H = \infty \). Finally, the goal of an RL algorithm is to learn an optimal policy \(\pi ^{*}_{\theta}\) such that the value function is maximized for each state. One way of finding a good policy is through the policy gradient method, i.e., finding an optimal set of parameters θ which maximize the value function of the policy (by evaluating its gradient). For the sake of brevity, we defer the explanation of the policy gradient method to Appendix B.
3 Related work
In the context of RL, two works [20, 21] developed optimizers based on policy gradient methods for VQA optimization, highlighting the robustness of RLbased techniques against offtheshelf optimizers in the presence of noise. As opposed to our work, both these works use an external RL policy to choose the angles of QAOA in a onestep Markov Decision Process (MDP) environment, and otherwise rely on the basic QAOA algorithm. A series of works [22, 23] have also used RLbased optimization to generalize the approach of QAOA for preparing the ground state of quantum manybody problems. In [22], an agent uses an autoregression mechanism to sample the gate unitaries in a onestep MDP and employs an offtheshelf optimizer to optimize angles to prepare a generalized QAOA ansatz. The same set of authors then unify their previous works [20, 22] with both the use of a generalized autoregressive architecture that incorporates the parameters of the continuous policy and an extended variant of Proximal Policy Optimization (PPO) applicable to hybrid continuousdiscrete policies [24]. We note that for all the works [20, 22–24], the quantum circuit (QAOAtype ansatz) is a part of an environment. In our case, we focus on employing reinforcement learning to enhance the performance of the RQAOA, inspired by a recent work [25] on using quantum circuits to design RL policies. In contrast to the approaches discussed above, we design an RL policy based on QAOA ansatz in a multistep MDP environment where the quantum circuit (QAOA ansatz) is not a part of the environment. Other works have used Qlearning to formulate QAOA into an RL framework to solve difficult combinatorial problems [26] and in the context of digital quantum simulation [27].
In the context of employing nonlocal postprocessing methods in quantum optimization algorithms akin to classical iterated rounding, there have been a few proposals to modify RQAOA. The main idea behind RQAOA is to use QAOA iteratively to compute correlations and then, at every iteration, employ a rounding (variable elimination) procedure to reduce the size of the problem by one. The variants of RQAOA proposed in the literature primarily differ in how the correlations are computed and how the variables are eliminated. For instance, in [11, 16], variable elimination scheme of RQAOA is deterministic and relies on correlations between qubits (qudits). On the other hand, the authors in [18, Sec. V.A.] propose a modified RQAOA where the rounding procedure is stochastic (controlled by a fixed hyperparameter β), and a variable is eliminated based on individual spin polarizations. In contrast, our proposal of RLRQAOA trains analogous parameter(s) β⃗ via RL (See Appendix C) and uses correlations between qubits to perform variable elimination.
Note added
Several preprints on iterative/recursive quantum optimization algorithms generalizing RQAOA have appeared since the submission of this work on arXiv. Parallel works such as [28–30] widen the selection and variable elimination schemes within the framework of recursive quantum optimization in application to constrained problems such as Maximum Independent Set (MIS) and Max2SAT. Moreover, [28] show theoretical justifications of why depth1 QAOA might not be a suitable candidate for quantum advantage and consequently urge the community to explore higher depth alternatives.
4 Limitations of RQAOA
This section highlights some algorithmic limitations of RQAOA by introducing an alternative perspective on it. Then, based on this perspective, we provide insights into when RQAOA might fail and why. It is obvious that (depth1) RQAOA must fail on some instances, since we assume \(\mathsf{BPP} \subsetneq \mathsf{NP}\),^{Footnote 4} but these instances may be quite big a priori. By “failure”, we mean that RQAOA can not find an optimal (exact) solution. Notably, even if depthl RQAOA fails to find exact solutions, it could still achieve an approximation ratio better than the bound known from inapproximability theory. In this case, \(NP \subseteq BQP\) still holds. For instance, if RQAOA fails to find an exact solution and still achieves an approximation ratio of \(16/17 + \epsilon \) or \(0.8785 + \epsilon \), then \(NP \subseteq BQP\) follows from [31, 32] under different complexitytheoretic assumptions, thus demonstrating quantum advantage. We primarily focus on finding smallsize instances since we need a data set of small instances to be able to computationally efficiently compare the performance of (depth1) RQAOA and RLRQAOA.
First, let us motivate the use of QAOA as a subroutine in RQAOA. In other words, why would one optimize depthl QAOA (i.e., find energyoptimal parameters for a Hamiltonian) and then use it in a completely different way (i.e., perform variable elimination by computing twocorrelation coefficients \(M_{u,v}\)). Intuitively, using QAOA in such a fashion makes sense because as depth \(l \rightarrow \infty \), the output of QAOA converges to the quantum state which is the uniform superposition over all optimal solutions, and hence, for each pair \((u,v) \in E\), computing the coefficient \(M_{u,v}\) exactly predicts if the edge is correlated (vertices with the same sign; i.e. lie in the same partition) or anticorrelated (vertices with the different sign; i.e. lie in the different partition) in an optimal cut. The next piece of intuition, which is not any kind of a formal argument, is that lowdepth QAOA prepares a superposition state where lowenergy states are more likely to have high probability amplitudes. Then the RQAOA selects the edge which is most correlated or anticorrelated in these lowenergy states. Furthermore, assuming that an ensemble of reasonable solutions often agree on which edges to keep and which ones to cut, RQAOA will select good edges to cut or keep (from the MaxCut perspective). However, we also expect RQAOA to fail sometimes, for instance, when the intuition mentioned above is wrong, or it assigns a wrong edgecorrelation sign to an edge for other reasons. Hence, as RQAOA fails, this raises the question of whether there are better angles to select an edge and its correct edgecorrelation sign at every iteration than those which coincide with energyoptimal angles (see Fig. 2).
RQAOA can alternatively be visualized as performing a tree search to find the most probable spin configuration close to the ground state of the Ising problem. In particular, at the kth level of the tree, nodes correspond to graphs with \(nk\) vertices, each having different edge sets. Suppose that a node has \(nk\) vertices with e edges, then it will have e many children where each child corresponds to a graph with \(nk1\) vertices having different edge sets following the edge contraction rules by imposing (6). The original RQAOA proposal [11] is a randomized algorithm (in the sense that ties between maximal twocorrelation coefficients are broken uniformly at random) on this tree exploring only a single path during one run and terminating at the \((nn_{c})^{th}\) level. The decision of choosing an appropriate branch is performed based on the largest magnitude of the absolute value of twocorrelation coefficients \(M_{u, v}\) computed via a depthl QAOA using \(H_{n}\)energy optimal parameters. While exploring levelbylevel, RQAOA assigns the edge correlations (−1 or +1) where a vertex is eliminated according to the constraint (6). We note that in the case of ties between maximal twocorrelation coefficients, independent runs of RQAOA might not necessarily induce the same search tree.
This alternative perspective described above provides some insights regarding the limitations of RQAOA: (i) when there are ties and branching occurs, it could be that only one path within a set of induced search tree leads to a good approximate solution; and (ii) it may be the case that even when there are no ties (i.e., one path and no branching), selecting edges to contract according to the maximal correlation coefficient stemming from energyoptimal parameters of QAOA is an incorrect choice to attain a good solution. A priori, it is not obvious if any of the above mentioned two possibilities can occur under the choice of energyoptimal angles. However, note that one of (i), (ii), or a combination of both must happen; otherwise, RQAOA is an efficient polynomialtime algorithm for the Ising problem. Hence, in the case that RQAOA makes an incorrect choice, RQAOA lacks the ability to explore the search tree to find better approximate solutions. Keeping these considerations in mind, we will show later that both phenomena (i) and (ii) occur by performing an empirical analysis of RQAOA. We now describe both the limitations mentioned above in detail below.

(i)
It may be the case that eliminating a variable by taking the argmax of the absolute value of twocorrelation coefficients is always a correct choice, but there can be more than one choice at every iteration. Moreover, it is possible to construct instances with a small number of optimal solutions, where for the majority of \(nn_{c}\) iterations (corresponding to the level of the tree) there is at least one tie (here, m ties corresponds to \(m+1\) pairs \((u_{1}, v_{1}), \ldots , (u_{m+1}, v_{m+1})\) with the same twocorrelation coefficient). In other words, the number of times RQAOA needs to traverse the search tree in the worse case to reach the ground state (optimum) may be exponentially large; i.e., every argmax tie break leads to a new branching of the potential choices of RQAOA, and this happens at each level of the tree. We showcase this phenomenon in our empirical analysis for one such family of instances (see Fig. 3). One may imagine perturbing the edge weights to avoid ties while preserving the ground states of the Hamiltonian, but no such perturbation is generally known.

(ii)
It may be the case that the path to reach the ground state requires the selection of a pair \((u,v)\) (and its correlation sign) for which the twocorrelation coefficient is not maximal according to QAOA at energyoptimal parameters (see Fig. 2). This implies that RQAOA might be prematurely locking out on optimal solutions.
We provide examples of graphs to prove the validity of the observations above. In the regime where there are ties between maximal correlation coefficients [(i)], we performed 200 independent RQAOA runs for the family of weighted \((d,g)\)cage graphs^{Footnote 5} (\(3 \leq d \leq 7\); \(5 \leq g \leq 12\); edge weights \(\{1, +1\}\)) where ties are broken uniformly at random for the \(nn_{c}\) iterations (levels of the tree). We work with these graphs because the subgraphs that (depth1) QAOA sees are regular trees (for most edges at every iteration of RQAOA, QAOA will see a \((d1)\)ary tree, as cage graphs are dregular graphs, which creates the situation of ties between correlation coefficients). Here, by seeing we refer to the fact that the output of depthl QAOA for a qubit (vertex) only depends on the neighbourhood of qubits that are within l distance to the given qubit [13]. For these graphs, we found that in \(86.4 \pm 9.63\%\) of the \(nn_{c}\) iterations, the variable to eliminate was chosen from the ties between maximal correlation coefficients (see Fig. 3).
To investigate the scenario of [(ii)], we focus on a particular case where there are no ties (or comparatively less ties) and find instances such that taking the maximal twocorrelation coefficient does not reach the optimum solution in the tree. For this, we performed a random search over an ensemble of 10600 weighted random dregular graphs and found several smallsize instances (#nodes ≤30) for which RQAOA did not attain the optimum.
Using both the theoretical and numerical observations discussed above, we create a dataset of graphs (containing both hard and random instances for RQAOA) for our later analysis. In the next section, we develop our new algorithm (RLRQAOA) and compare its performance to RQAOA to assess the benefit of employing reinforcement learning in the context of recursive quantum optimization specifically for hard instances. Finally, we give the relevant details about the data set of the graph ensemble considered in Sect. 6.1.
5 Reinforcement learning enhanced RQAOA & classical brute force policy
Having introduced the background of policy gradient methods and the limitations of RQAOA, we develop a QAOAinspired policy which selects a branch in the search tree (eliminate a variable) at every iteration of RLRQAOA. Recall that, even though selecting an edge to contract according to the maximal twocorrelation coefficient is often a good choice, it is not always an optimal one, and also often, there is no single best option, but more (for instance, see Fig. 2). Our basic idea is to train an RL method to learn how to select the edges to contract (along with its edgecorrelation sign) correctly while using the information generated by QAOA. Additionally, to investigate the power of the quantum circuit within the quantumclassical arrangement of RLRQAOA, we design a classical analogue of RLRQAOA called reinforcementlearning recursive ONE (RLRONE) and compare it with RLRQAOA.
To overcome the limitations of RQAOA, one needs to carefully tweak (a) RQAOA’s variable elimination subroutine and (b) the use of QAOA as a subroutine; i.e., instead of finding energyoptimal parameters, we learn the parameters of QAOA. For (a), we apply the nonlinear activation function \(\mathsf{softmax}_{\vec{\beta}}\) (see Def. 1) on the absolute value of twocorrelation coefficients \(M_{u,v}\) measured on \({ \vert {\Psi _{l}(\vec{\alpha }, \vec{\gamma })} \rangle }\). By doing this, the process of selecting a variable to eliminate (and its sign) is represented by a smooth approximation of argmax that is controlled by a vector of trainable inverse temperature parameters β⃗ (one β per edge). The parameters β⃗ (initialized at low values) are then trained such that the probability of selecting an edge (or a branch at every iteration) with the highest expected reward tends to 1. In the case of (b), we train the variational angles of QAOA in the course of learning rather than using the ones that give optimal energy. We do this because of the following two reasons: (i) to avoid costly optimization;^{Footnote 6} (ii) different angle choices can help the algorithm sometimes to choose optimal paths in the search tree that are not possible otherwise (see Fig. 2). We note that the entire learning happens on one instance of the Ising problem. Even though it is conceivable to train the algorithm over an ensemble of instances by introducing suitable generalization mechanisms such that β⃗ are dependent on instances, we solely focus on learning parameters of the policy of RLRQAOA on one instance so that it eventually performs better than RQAOA.
To provide further details on the effective Markov Decision Process (MDP) that the above described policy will be exploring, note that the RQAOA method can be interpreted as a multistep (also called a nstep) MDP environment (with a delayed reward and a nontrainable policy), where at every iteration, a variable is eliminated based on the information generated by QAOA. Let us now cast the learning problem of variable elimination in the RL framework, inspired by recent work [25] on using quantum circuits to design RL policies. For every step of the episode,^{Footnote 7} our RL agent is required to choose one action out of the discrete space equivalent to an edge set of the underlying graph; i.e., in the worse case, selects one edge from \(\binom{n}{2}\), on which it imposes a constraint of the form (6). Hence, the state space \(\mathcal{S}\) consists of weighted graphs (which we could encounter during an RQAOA run) and the action space \(\mathcal{A}\) consists of edges (and ±1 edgecorrelations to impose on them). The actions are selected using a parameterized policy \(\pi _{\theta}(as)\) which is based on the QAOA ansatz. Since, we use the expectation value of the Hamiltonian \(H_{n}\) of the Ising problem as an objective function, the reward space is \(\mathcal{R} = [0, \max_{x \in \{0,1\}^{n}} { \langle{x} \vert }H_{n}{ \vert {x} \rangle }]\).
Next, we formally define the policy of RLRQAOA and its learning algorithm, which is a crucial part of RLRQAOA.
Definition 1
(Policy of RLRQAOA)
Given a depthl QAOA ansatz acting on n qubits, defined by a Hamiltonian \(H_{n}\) (with an underlying graph \(G_{n} = (V,E)\)) and variational parameters \(\{{\vec{\alpha}}, \vec{{\gamma}}\} \in [0, 2\pi ]^{2l}\), let \(M_{u, v} = { \langle{\Psi _{l}(\vec{{\alpha }}, \vec{{\gamma }})} \vert } Z_{u} Z_{v} { \vert {\Psi _{l}(\vec{{\alpha }}, \vec{{\gamma }})} \rangle }\) be the twocorrelations that it generates. We define the policy of RLRQAOA as
where actions a correspond to edges \((u,v) \in E(G_{n})\), states s to graphs \(G_{n}\) and \(\beta _{u,v} \in \mathbb{R}\) (exists for every possible edge) is an inverse temperature parameter. Here, \(\theta = (\vec{\alpha}, \vec{\gamma}, \vec{\beta})\) constitutes all trainable parameters, where \(\vec{\beta} \in \mathbb{R}^{(n^{2}  n)/2}\).
The reader is referred to Alg. 2 for the pseudocode of RLRQAOA (for one episode), where the addition of RL components are highlighted in the shade of green. Furthermore, we note that RLRQAOA is a generalized version of RQAOA because the former is exactly equivalent to the latter when the energyoptimal parameters \(\{\vec{\alpha}, \vec{\gamma}\}\) are specified by QAOA on \(H_{n}\), and for all \((u, v) \in E\), \(\beta _{u,v} \in \vec{\beta}\), where \(\beta _{u,v} = \infty \).
Since the vector β⃗ is edge specific and as we learn \(\beta _{u,v} \in \vec{\beta}\) for \(\{u,v\} \in E\) separately for every instance, we develop a fully classical RL algorithm, namely RLRONE, to simply learn \(\beta _{u,v}\) for all edges directly in spite of where the twocorrelation coefficients \(M_{u,v}\) are generated from. It is natural to consider this because it might be the case that in the hybrid quantumclassical arrangement of RLRQAOA, the classical part (learning of \(\beta _{u,v}\) for \(\{u,v\} \in E\)) is more powerful than the quantum part (computing twocorrelations \(M_{u, v}\) for \(\{u,v\} \in E\) from the QAOA ansatz at given variational angles \(\{{\vec{\alpha}}, \vec{{\gamma}}\}\)). Hence, in order to assess the contribution of the quantum circuit in RLRQAOA, we define the policy of RLRONE such that for each edge, we fix the twocorrelation \(M_{u, v} = 1\); i.e., we do not use any output from the quantum circuit. Although simply using \(M_{u, v} = 1\) in Def. 1, the policy will select an edge and always assign it to be correlated, rendering it to be less expressive. A solution to this problem is to simultaneously learn the parameters \(\beta _{u,v}^{+1}\) (correlated edge) and \(\beta _{u,v}^{1}\) (anticorrelated edge) for each edge. Then the resulting RLRONE algorithm is expressive enough to reach the optimum solution. Moreover, it has trainable inverse temperature parameters β⃗ where \(\vec{\beta} = n^{2}  n\) for n the number of nodes of the graph \(G_{n}\). The notion of an action slightly differs from the RLRQAOA policy as the action here corresponds to selecting an edge along with its sign (+1 and −1 for correlated and anticorrelated edges, respectively), while in RLRQAOA, the twocorrelation coefficient implicitly selects this sign. We formally define the policy of RLRONE below.
Definition 2
(Policy of RLRONE)
Given a Hamiltonian \(H_{n}\) (with an underlying graph \(G_{n} = (V,E)\)), we define the policy of RLRONE as
where actions a correspond to edges \((u,v) \in E(G_{n})\) along with an edge correlation \(b \in \{\pm 1\}\), states s correspond to graphs \(G_{n}\) and \(\beta _{u,v}^{\pm 1} \in \mathbb{R}\) (exists for every edge) are inverse temperature parameters. Here, \(\theta = (\vec{\beta}^{+1}, \vec{\beta}^{1})\) constitutes all trainable parameters, where \(\vec{\beta}^{\pm 1} \in \mathbb{R}^{(n^{2}  n)/2}\).
The classical analogue RLRONE can then be simulated by performing the following modifications to Alg. 2: (i) modify the parameters θ of the policy of RLRONE by \(\theta = (\vec{\beta}^{+1}, \vec{\beta}^{1})\), (ii) delete \(\mathsf{Lines 4}\) and 5, (iii) update \(\mathsf{Line~6}\) by incorporating the policy of RLRONE and the constraint (6) in \(\mathsf{Line~7}\) is imposed by feeding the correlation sign of the edge from the output (\(b \in \{\pm 1\}\)) of the policy of RLRONE.
We train both the policies of RLRQAOA and RLRONE using the Monte Carlo policy gradient algorithm REINFORCE, as explained in Appendix B. Also, refer to Alg. 1 for the pseudocode. The horizon (length) of an episode is \(nn_{c}\). The value function is defined as \(V_{\pi _{\theta}}(H_{n}) = \mathbb{E}_{{\pi _{\theta}}} [\gamma ^{nn_{c}} \cdot { \langle{x} \vert }H_{n}{ \vert {x} \rangle } ]\), where \(\gamma \in [0,1]\), \(H_{n}\) is the Hamiltonian defined on n variables for a problem instance and x is a binary bitstring as defined in \(\mathsf{Line~14}\) of Alg. 2.
In this work, we only focus on simulations of depth1 RQAOA and RLRQAOA. Indeed, the particular case of depth1 quantum circuits and Ising Hamiltonian RQAOA can be simulated efficiently classically; see Sect. 2.2 and Appendix A. However, classical simulatibility is not known for Ising cost functions at depth larger than 2 [17], and more general cost Hamiltonians even at depth1 (e.g., MaxkXOR on arbitrary hypergraphs), leaving room for both quantum and RLenhanced quantum advantage.
6 Numerical advantage of RLRQAOA over RQAOA and RLRONE
In the previous section, we have introduced both our quantum (inspired) policy of RLRQAOA and an entirely classical policy of RLRONE, and their design choices, and based on these; we propose an RLenhanced RQAOA and its classical analogue RLRONE. Although we gave justifications for these choices, it is natural to evaluate their influence on the performance of RLRQAOA and RLRONE. In this section, we first describe how we found hard instances for RQAOA and discuss their properties. We then describe the results of our numerical simulations, where we consider both hard instances and random instances to benchmark the performance of (depth1) RQAOA, RLRQAOA, and RLRONE. The reader is referred to Appendix C for implementation details for the above algorithms.
6.1 Hard instances for RQAOA
Here, our focus is on finding smallsize hard instances (with approximation ratio as a metric) for the Ising problem where RQAOA fails. Note that, we assume it must fail to solve exactly as if it does not, then \(\mathsf{NP} \subseteq \mathsf{BQP}\) as the Ising problem is NPhard in general. As we lack techniques to analyze the performance guarantees of RQAOA at arbitrary depth l apart from special cases like “ring of disagrees” at depth1 [11], it is a nontrivial task to find hard instances for RQAOA. In this spirit, we generate an ensemble \(\mathcal{G}[n, d, w]\) of weighted random dregular instances with n vertices and edge weight distribution \(w: E \rightarrow \mathbb{R}\). We then perform a random search over \(\mathcal{G}[n, d, w]\) to find hard instances. Concretely, we construct a graph ensemble \(\mathcal{G}[n,d,w]\) as follows: for each tuple of parameters \((n, d, w) \in \{14, 15, \ldots , 30\} \times \{3, 4, \ldots , 29\} \times \{\mathrm{Gaussian}, \mathrm{bimodal}\}\), we generate 25 graphs whenever possible^{Footnote 8} yielding 10600 graphs in total, where Gaussian \((\mathcal{N}(0,1))\) and bimodal \((\{\pm 1\})\) are edge weight distributions. Intuitively, the instances with bimodal edge weights would have a huge level of degeneracy within the ground states, which is confirmed by our simulations. Moreover, for the instances with bimodal edge weights, where ties between twocorrelation coefficients were encountered, the final approximation ratio was computed based on the best energy attained by running RQAOA for a maximum of 1400 independent runs. On the other hand, for the instances with Gaussian edge weights \(\mathcal{N}(0,1)\), we found that all instances had unique ground states. Hence, we ran RQAOA only once to get the best approximation ratio for instances with Gaussian edge weights.
We filter out 1027 (857 with bimodal weights and 170 with Gaussian weights) instances for which RQAOA’s approximation ratio is less than 0.95. Note that RQAOA can only be closer to optimal the larger \(n_{c}\) is. In other words, it monotonically improves the quality of the solution with an increase in \(n_{c}\). Since we want to improve upon RQAOA in its strongest regime, we choose \(n_{c} = 8\) (unless specified otherwise) for our numerical simulations. However, interestingly for the 1027 hard instances found above, even with \(n_{c} = 4\), we only found 26 instances (5 with bimodal weights and 21 with Gaussian weights) for which the approximation ratio decreased (for the rest, the approximation ratio remained the same). We chose \(n_{c}=4\) for the previously mentioned experiment because, for some instances, the edge weights cancelled out after an edge contraction subroutine, and as a consequence, the intermediate graph ended up being an empty graph (a graph with zero edge weights) for \(1 \leq n_{c} < 4\).
6.2 Benchmarking
6.2.1 RLRQAOA vs RQAOA on cage graphs
In our first set of experiments, illustrated in Fig. 4, we compare the performance of RLRQAOA with RQAOA on random Ising instances derived from \((d,g)\)cage graphs (\(3 \leq d \leq 7\); \(5 \leq g \leq 12\); edge weights \(\{1, +1\}\)). The aim of this experiment is twofold: first, to show that RLRQAOA does not perform much worse than RQAOA on instances where the latter performs quite well; second, to test the advantage of RLRQAOA over RQAOA in terms of the probability of attaining the optimal solution when there are many ties between twocorrelation coefficients \(M_{u,v}\) at every iteration. Notably, we already demonstrated earlier (see Fig. 3) that for cage graphs, RQAOA has a constant number of ties between maximal twocorrelation coefficients for the majority of the \(nn_{c}\) iterations. For assessing our hypotheses, we evaluate the average learning performance over 15 independent RLRQAOA runs over 1400 episodes. In order to fairly compare RLRQAOA with RQAOA, we run RQAOA independently for 1400 runs and choose the best solution from the result these runs. Note that, this is a more powerful heuristic than the vanillaRQAOA (which outputs the first solution it finds) where the hyperparameter (the number of independent runs) controls the solution quality. Both RLRQAOA (vote variant) and RQAOA fail to reach the optimum for \((3, 12)\)cage graph within the given budget (see Fig. 4). However, by evaluating the resulting learning curves of RLRQAOA, both our hypotheses can be confirmed for majority of the instances.
6.2.2 RQAOA vs RLRQAOA on hard instances
For the next set of experiments, presented in Fig. 5, the flavour here is similar to the previous experiment but with the aim to show separation between RLRQAOA and RQAOA for hard instances found in Sect. 6.1. More specifically, we show that RLRQAOA always performs better than RQAOA on these instances in terms of the best approximation ratio achieved. We do this by evaluating average learning performance over 15 independent RLRQAOA runs to assess this claim. Interestingly, RLRQAOA outperformed RQAOA even when the angles of the QAOA circuit were initialized randomly.
6.2.3 RLRQAOA vs RLRONE
However, the results in the previous two subsections do not indicate the importance of the quantum part in the quantumclassical arrangement. To address this, we performed a third set of experiments, presented in Fig. 6 where both algorithms were tested on random 3regular graphs of 100 and 200 nodes. By comparing the performance of RLRONE with RLRQAOA, we can see a clear separation between learning curves of the agents of these algorithms, highlighting the effectiveness of the quantum circuits in solving the Ising problem.
7 Discussion
In this work, we analyzed the bottlenecks of a nonlocal variant of QAOA, namely recursive QAOA (RQAOA), and based on this, propose a novel algorithm that uses reinforcement learning (RL) to enhance the performance of the RQAOA (RLRQAOA). In the process of analyzing the bottlenecks of RQAOA for the Ising problem, we find smallsize \(hard\) Ising instances from a graph ensemble of random weighted dregular graphs. To avoid missing out on better optimal solutions at every iteration, we cast the variable elimination problem within the RQAOA as a reinforcement learning framework; we introduce a quantum (inspired) policy of RLRQAOA, which controls the task of switching between exploitative or exploratory behaviour of RLRQAOA. We demonstrate via numerical simulations that formulating RQAOA into the RL framework boosts the performance and performs as well as RQAOA on random instances and beats RQAOA on all hard instances we have identified. Finally, we note that all the numerical simulations for RQAOA (depth1) and the proposed hybrid algorithm RLRQAOA (depth1) were performed classically, and no quantum advantage is to be expected unless we simulate both of them at higher depths. An interesting followup to this work would be to assess the performance of both RQAOA and RLRQAOA at higher depths on an actual quantum processing unit (QPU) in both noise and noisefree regimes.
Data availability
The datasets and the code used and/or analysed during the current study is available at https://github.com/Zakuta/RLRQAOApapercode/.
Notes
The time complexity of these auxiliary steps should be polynomial in input size for the algorithm to remain practically viable.
The textbook Ising problem definition has a negative sign \(()\), and the goal is to minimize the Hamiltonian.
The bitstring \(\{0,1\}^{n}\) is analogous to the spin configuration \(\{1,+1\}^{n}\) where 0 corresponds to −1 and +1 to 1. Hereafter, we will use both of them interchangeably.
Here, we use the complexitytheoretic assumption of \(\mathsf{BPP} \subsetneq \mathsf{NP}\) because depth1 RQAOA can be simulated classically.
A \((d,g)\)cage graph (\(d \geq 3\), \(g \geq 5\)) is a dregular graph of girth g (length of a shortest cycle contained in the graph) consisting of the smallest number of vertices possible.
We train the QAOA angles at depth1 even though we can optimize them efficiently (see Appendix D). However, the optimization becomes nontrivial with an increase in depth.
Here, one episode corresponds to one complete run of (RL)RQAOA.
For generating dregular graphs with n vertices, \(1 \leq d \leq n1\) and further if d is odd, n must be even.
Note that, due to external fields \(h_{u}\), the adjacent matrix A will typically also have nonzero elements along its diagonal.
This means that we have a precision of 10^{−3}. One could in principle aim for higher precision by making a finer grid, but our observations from numerical simulations showed that grid size \(N=2000\) was sufficient.
Abbreviations
 NISQ:

Noisy Intermediate Scalable Quantum
 VQA:

Variational Quantum Algorithm
 NP:

Nondeterministic Polynomial time
 BQP:

Bounded error Quantum Polynomial time
 BPP:

Bounded error Probabilistic Polynomial time
 QAOA:

Quantum Approximate Optimization Algorithm
 RQAOA:

Recursive Quantum Approximate Optimization Algorithm
 DNN:

Deep Neural Network
 RL:

Reinforcement Learning
 MDP:

Markov Decision Process
 RLRQAOA:

Reinforcement Learning enhanced Recursive Quantum Approximate Optimization Algorithm
 RLRONE:

Reinforcement Learning enhanced Recursive ONE (classical)
 CPU:

Central Processing Unit
 QPU:

Quantum Processing Unit
 COBYLA:

Constrained Optimization By Linear Approximation
 PG:

Policy Gradient
 PPO:

Proximal Policy Optimization
References
Google AI Quantum and Collaborators, Arute F, Arya K, Babbush R, Bacon D, Bardin JC, Barends R, Boixo S, Broughton M, Buckley BB et al.. HartreeFock on a superconducting qubit quantum computer. Science. 2020;369(6507):1084–9. https://doi.org/10.1126/science.abb9811.
Jurcevic P, JavadiAbhari A, Bishop LS, Lauer I, Bogorin DF, Brink M, Capelluto L, Günlük O, Itoko T, Kanazawa N, Kandala A, Keefe GA, Krsulich K, Landers W, Lewandowski EP, McClure DT, Nannicini G, Narasgond A, Nayfeh HM, Pritchett E, Rothwell MB, Srinivasan S, Sundaresan N, Wang C, Wei KX, Wood CJ, Yau JB, Zhang EJ, Dial OE, Chow JM, Gambetta JM. Demonstration of quantum volume 64 on a superconducting quantum computing system. Quantum Sci Technol. 2021;6(2):025020. https://doi.org/10.1088/20589565/abe519.
Ebadi S, Wang TT, Levine H, Keesling A, Semeghini G, Omran A, Bluvstein D, Samajdar R, Pichler H, Ho WW, Choi S, Sachdev S, Greiner M, Vuletić V, Lukin MD. Quantum phases of matter on a 256atom programmable quantum simulator. Nature. 2021;595(7866):227–32. https://doi.org/10.1038/s41586021035824.
Gong M, Wang S, Zha C, Chen MC, Huang HL, Wu Y, Zhu Q, Zhao Y, Li S, Guo S, Qian H, Ye Y, Chen F, Ying C, Yu J, Fan D, Wu D, Su H, Deng H, Rong H, Zhang K, Cao S, Lin J, Xu Y, Sun L, Guo C, Li N, Liang F, Bastidas VM, Nemoto K, Munro WJ, Huo YH, Lu CY, Peng CZ, Zhu X, Pan JW. Quantum walks on a programmable twodimensional 62qubit superconducting processor. Science. 2021;372(6545):948–52. https://doi.org/10.1126/science.abg7812.
Moll N, Barkoutsos P, Bishop LS, Chow JM, Cross A, Egger DJ, Filipp S, Fuhrer A, Gambetta JM, Ganzhorn M, Kandala A, Mezzacapo A, Müller P, Riess W, Salis G, Smolin J, Tavernelli I, Temme K. Quantum optimization using variational algorithms on nearterm quantum devices. Quantum Sci Technol. 2018;3(3):030503. https://doi.org/10.1088/20589565/aab822.
Benedetti M, Lloyd E, Sack S, Fiorentini M. Parameterized quantum circuits as machine learning models. Quantum Sci Technol. 2019;4(4):043001. https://doi.org/10.1088/20589565/ab4eb5.
Farhi E, Goldstone J, Gutmann S. A quantum approximate optimization algorithm. 2014. arXiv preprint. arXiv:1411.4028.
Hastings MB. Classical and quantum bounded depth approximation algorithms. 2019. arXiv preprint. arXiv:1905.07047.
Marwaha K. Local classical maxcut algorithm outperforms \(p= 2\) qaoa on highgirth regular graphs. Quantum. 2021;5:437. https://doi.org/10.22331/q20210420437.
Barak B, Marwaha K. Classical algorithms and quantum limitations for maximum cut on highgirth graphs. 2022. https://doi.org/10.4230/LIPICS.ITCS.2022.14.
Bravyi S, Kliesch A, Koenig R, Tang E. Obstacles to variational quantum optimization from symmetry protection. Phys Rev Lett. 2020;125(26). https://doi.org/10.1103/physrevlett.125.260505.
Farhi E, Gamarnik D, Gutmann S. The quantum approximate optimization algorithm needs to see the whole graph: a typical case. 2020. arXiv preprint. arXiv:2004.09002.
Farhi E, Gamarnik D, Gutmann S. The quantum approximate optimization algorithm needs to see the whole graph: worst case examples. 2020. arXiv preprint. arXiv:2005.08747.
Chou CN, Love PJ, Sandhu JS, Shi J. Limitations of local quantum algorithms on random maxkxor and beyond. 2021. arXiv preprint. arXiv:2108.06049.
Marwaha K, Hadfield S. Bounds on approximating max k xor with quantum and classical local algorithms. 2021. arXiv preprint. arXiv:2109.10833.
Bravyi S, Kliesch A, Koenig R, Tang E. Hybrid quantumclassical algorithms for approximate graph colouring. Quantum. 2022;6:678. https://doi.org/10.22331/q20220330678.
Bravyi S, Gosset D, Grier D. Classical algorithms for forrelation. 2021. arXiv preprint. arXiv:2102.06963.
McClean JR, Harrigan MP, Mohseni M, Rubin NC, Jiang Z, Boixo S, Smelyanskiy VN, Babbush R, Neven H. Lowdepth mechanisms for quantum optimization. PRX Quantum. 2021;2(3). https://doi.org/10.1103/prxquantum.2.030312.
Sutton RS, Barto AG. Reinforcement learning: an introduction. 2018.
Yao J, Bukov M, Lin L. Policy gradient based quantum approximate optimization algorithm. In: Mathematical and scientific machine learning. 2020. p. 605–34. PMLR.
Sung KJ, Yao J, Harrigan MP, Rubin NC, Jiang Z, Lin L, Babbush R, McClean JR. Using models to improve optimizers for variational quantum algorithms. Quantum Sci Technol. 2020;5(4):044008. https://doi.org/10.1088/20589565/abb6d9.
Yao J, Lin L, Bukov M. Reinforcement learning for manybody groundstate preparation inspired by counterdiabatic driving. Phys Rev X. 2021;11(3). https://doi.org/10.1103/physrevx.11.031070.
Yao J, Lin L, Bukov M. Rlqaoa: a reinforcement learning approach to manybody ground state preparation. Bull Am Phys Soc. 2021;66.
Yao J, Kottering P, Gundlach H, Lin L, Bukov M. Noiserobust endtoend quantum control using deep autoregressive policy networks. In: Mathematical and scientific machine learning. 2022. p. 1044–81. PMLR.
Jerbi S, Gyurik C, Marshall S, Briegel H, Dunjko V. Parametrized quantum policies for reinforcement learning. Adv Neural Inf Process Syst. 2021;34.
Wauters MM, Panizon E, Mbeng GB, Santoro GE. Reinforcementlearningassisted quantum optimization. Phys Rev Res. 2020;2(3). https://doi.org/10.1103/physrevresearch.2.033446.
Khairy S, Shaydulin R, Cincio L, Alexeev Y, Balaprakash P. Learning to optimize variational quantum circuits to solve combinatorial problems. Proc AAAI Conf Artif Intell. 2020;34(03):2367–75. https://doi.org/10.1609/aaai.v34i03.5616.
Brady LT, Hadfield S. Iterative quantum algorithms for maximum independent set: a tale of lowdepth quantum algorithms. 2023. arXiv:2309.13110.
Finžgar JR, Kerschbaumer A, Schuetz MJ, Mendl CB, Katzgraber HG. Quantuminformed recursive optimization algorithms. 2023. arXiv preprint. arXiv:2308.13607.
Dupont M, Evert B, Hodson MJ, Sundar B, Jeffrey S, Yamaguchi Y, Feng D, Maciejewski FB, Hadfield S, Alam MS et al.. Quantumenhanced greedy combinatorial optimization solver. Sci Adv. 2023;9(45):0487.
Håstad J. Some optimal inapproximability results. J ACM. 2001;48(4):798–859.
Khot S, Kindler G, Mossel E, O’Donnell R. Optimal inapproximability results for maxcut and other 2variable csps? SIAM J Comput. 2007;37(1):319–57.
Ozaeta A, van Dam W, McMahon PL. Expectation values from the singlelayer quantum approximate optimization algorithm on ising problems. 2021. arXiv preprint. arXiv:2012.03421.
Sutton RS, McAllester D, Singh S, Mansour Y. Policy gradient methods for reinforcement learning with function approximation. Adv Neural Inf Process Syst. 1999;12.
Kakade SM. On the sample complexity of reinforcement learning. 2003.
Williams RJ. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Mach Learn. 1992;8(3–4):229–56. https://doi.org/10.1007/bf00992696.
Konda V, Tsitsiklis J. Actorcritic algorithms. Adv Neural Inf Process Syst. 1999;12.
Bittel L, Kliesch M. Training variational quantum algorithms is NPhard. Phys Rev Lett. 2021;127(12). https://doi.org/10.1103/physrevlett.127.120502.
Brandao FG, Broughton M, Farhi E, Gutmann S, Neven H. For fixed control parameters the quantum approximate optimization algorithm’s objective function value concentrates for typical instances. 2018. arXiv preprint. arXiv:1812.04170.
Lotshaw PC, Humble TS, Herrman R, Ostrowski J, Siopsis G. Empirical performance bounds for quantum approximate optimization. Quantum Inf Process. 2021;20(12):403. https://doi.org/10.1007/s11128021033423.
Wurtz J, Lykov D. Fixedangle conjectures for the quantum approximate optimization algorithm on regular MaxCut graphs. Phys Rev A. 2021;104(5). https://doi.org/10.1103/physreva.104.052419.
Shaydulin R, Lotshaw PC, Larson J, Ostrowski J, Humble TS. Parameter transfer for quantum approximate optimization of weighted maxcut. 2022. arXiv preprint. arXiv:2201.11785.
Moussa C, Wang H, Bäck T, Dunjko V. Unsupervised strategies for identifying optimal parameters in quantum approximate optimization algorithm. EPJ Quantum Technol. 2022;9(1). https://doi.org/10.1140/epjqt/s40507022001314.
Kingma DP, Ba J. Adam: a method for stochastic optimization. 2014. arXiv preprint. arXiv:1412.6980.
Acknowledgements
YJP would like to thank Simon Marshall and Charles Moussa for useful discussions. SJ would like to thank Hans Briegel for useful discussions in the early phases of this project. The authors thank Adrián PérezSalinas, and Andrea Skolik for useful comments on an earlier version of this manuscript and Casper Gyurik for reading the final version of this manuscript. TB, VD, and YJP acknowledge support from TotalEnergies. SJ acknowledges support from the Austrian Science Fund (FWF) through the projects DKALM:W1259N27 and SFB BeyondC F7102. SJ also acknowledges the Austrian Academy of Sciences as a recipient of the DOC Fellowship. The computational results presented here have been achieved in part using the LEO HPC infrastructure of the University of Innsbruck and DSlab infrastructure of the Leiden Institute of Advanced Computer Science (LIACS) at Leiden University.
Funding
This work was in part supported by the Dutch Research Council (NWO/OCW), as part of the Quantum Software Consortium programme (project number 024.003.037). VD acknowledges the support by the project NEASQC funded from the European Union’s Horizon 2020 research and innovation programme (grant agreement No 951821). VD also acknowledges support through an unrestricted gift from Google Quantum AI.
Author information
Authors and Affiliations
Contributions
YJP and SJ contributed equally to this work. YJP, SJ, and VD designed all the experiments. The manuscript was written with contributions from all authors. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Theorem on classical simulability of Ising problem for depth1 QAOA
Theorem 1
( [11, 33]) Given an Ising cost Hamiltonian \(H_{n} = \sum_{u \in V} h_{u} Z_{u} + \sum_{(u,v) \in E} J_{uv} Z_{u} Z_{v}\). Define \(s(x) := \sin(x)\) and \(c(x) := \cos(x)\). Then for a fixed pair of qubits \(1 \leq u \leq v \leq n\),
where,
and
Here, w.l.o.g we assume that the underlying graph is a complete graph \(K_{n}\) and \(\gamma = 1\) since it can be absorbed into the definition of adjacency matrix A^{Footnote 9}of the graph.
Appendix B: Policy gradient method
This appendix provides a more detailed description of the policy gradient method used to find an optimal policy that simultaneously optimizes variational parameters QAOA and selects a decision variable to eliminate within the RLRQAOA.
The crux of policy gradient methods lies in (i) a parameterized policy \(\pi _{\theta}\), which drives an agent’s action in an environment, and (ii) value function \(V_{\pi _{\theta}}\) that evaluates longterm performance associated with a policy \(\pi _{\theta}\). Policy gradient methods employ a simple optimization approach; i.e., they start with an initial policy \(\pi _{\theta}\) and update the parameters of the policy iteratively by using a gradient ascent algorithm such that the value function \(V_{\pi _{\theta}}(s_{0})\) associated with it is maximized. This approach can be efficiently applied if one can either evaluate the value function of the policy or at least its gradient \(\nabla _{\theta}V_{\pi _{\theta}}\). In the case of policy gradient methods, the gradient of the value function \(\nabla _{\theta}V_{\pi _{\theta}}\) can be evaluated analytically using Monte Carlo rollouts within an environment. We formally state this in Theorem 2.
Policy gradient Theorem
In practice, the value function can be estimated via a Monte Carlo approach: (i) collect N samples of episodes τ of interactions governed by policy \(\pi _{\theta}\) within an environment; (ii) compute expected return \(R(\tau )\) of each episode as in (9); and (iii) average out the results. Then the Monte Carlo estimate of value function can be written as
Theorem 2
(Policy Gradient Theorem [34]) Given an environment defined by its dynamics \(\mathbb{P}_{E}\) and a parameterized policy \(\pi _{\theta}\), the gradient of the value function as defined in (9), w.r.t. θ, is given by
This theorem enables the estimation of the gradient of the value function analytically, whose evaluation in the context of sample complexity scales only logarithmic in the parameters θ of the policy [35].
Moreover, one can estimate the value function terms \(V_{\pi _{\theta}}(s_{t})\) in (15) by collecting rewards from the Monte Carlo rollouts (as defined in (14)). This learning algorithm is called the Monte Carlo Policy Gradient algorithm, otherwise known as REINFORCE [19, 36]. In the literature, there exist other sophisticated approaches, such as the actorcritic method [37], where the value function is estimated using an additional approximator such as a deep neural network (DNN).
Appendix C: Implementation details of algorithms
This appendix provides specifications for simulating RQAOA, RLRQAOA, RLRONE, and Gurobi optimizer.
RQAOA
The authors in [38] prove that finding optimal parameters in QAOA is an NPhard problem even with logarithmically many qubits at depth 1. We also often found that the landscape had many extrema and saddle points detrimental to gradientbased methods in QAOA and RQAOA. Hence, we use a brute force search to optimize the variational parameters to alleviate this problem. To perform the brute force search for depth1 QAOA efficiently, we show in Appendix D that for any fixed value \(\gamma \in \mathbb{R}\), one can compute \(\alpha \in \mathbb{R}\) maximizing the energy over \((\alpha , \gamma )\) by solving a system of equations as defined in (21). We thus chose 2000 equidistant grid points^{Footnote 10}\(\gamma _{1}, \ldots , \gamma _{2000}\) in the interval \([0, 2\pi ]\) for γ. After finding the grid point \(\gamma _{k}\) that maximizes the energy, we performed another refined local optimization with an offtheshelf optimizer COBYLA in the interval \([\gamma _{k1}, \gamma _{k+1}]\). Finally, we note that the optimal angles of QAOA for random graphs in \(\mathcal{G}[n, d, w]\) concentrate, which is in line with several theoretical as well as empirical results in the literature [14, 39–43]. Throughout, we choose \(n_{c} = 8\) otherwise specified explicitly.
RLRQAOA
Here, we discuss some design choices for RLRQAOA as it is not entirely clear which one of them has a positive influence on the learning performance of the policy of RLRQAOA. Firstly, there can be three choices to define β⃗ that all recover the RQAOA policy for large β⃗:

(i)
\(\mathsf{(\beta all)}\) \(\vec{\beta} = \{\beta \}^{(nn_{c})}\), i.e., only one parameter among all edges for every iteration of RLRQAOA;

(ii)
\(\mathsf{(\beta oneall)}\) \(\vec{\beta} = \{\beta _{u,v}\}^{(n^{2}n)/2}\), \(\forall (u,v) \in E\), i.e., for each edge the RL agent learns the value of \(\beta _{u,v}\) accounting to a total of \(\binom{n}{2} \beta \)’s; and

(iii)
\(\mathsf{(\beta allall)}\) \(\vec{\beta} = \{\beta _{u,v}\}^{(nn_{c})(n^{2}n)/2}\), i.e., similar to (ii) but for every iteration of RLRQAOA, the \(\beta _{u,v}\) are learnt separately.
Secondly, one can initialize the variational angles \(\{\alpha , \gamma \}\) randomly, with extremum points or with optimal QAOA angles at every iteration, and then train an agent to learn the angles. However, in our simulations, we always warm start the RL agent with optimal QAOA angles to better capture the power of the quantum circuits. We call this model a \(\mathsf{WSRLRQAOA_{\alpha , \gamma , \beta}}\) where the subscript highlights the parameters for which an agent is trained. Following the above nomenclature, \(\mathsf{WSRLRQAOA_{\alpha , \gamma , \beta}}\) and \(\mathsf{WSRLRQAOA_{\beta}}\) are different models of RLRQAOA where both of them have been warm started using QAOA angles but only differ in their learning procedures; i.e., an agent for the former trains for optimizing both the variational angles and β⃗, and the latter only optimizes β⃗ with fixed optimal QAOA angles.
Empirically, we found that an agent with the configuration of \(\mathsf{WSRLRQAOA_{\alpha , \gamma , \beta}}\) with \(\mathsf{\beta oneall}\) configuration is enough to beat RQAOA (in fact, often solve optimally) within 1400 episodes (predefined) at least for all the graphs with less than 30 vertices. Hence, we use the choice of β⃗ as \(\mathsf{(\beta oneall)}\) in the rest of the manuscript unless specified explicitly.
Finally, we mention the hyperparameters with which we performed our simulations. We train an agent to choose both the set of variational angles and inversetemperature constants using the policy gradient method. We set the discounted factor \(\gamma = 0.99\), \(\mathsf{(\beta oneall)}\) \(\vec{\beta} = \{25\}^{(n^{2}n)/2}\) and use ADAM [44] as an optimizer with learning rates \(\{{lr}_{\mathrm{angles}}, {lr}_{\mathrm{betas}} = 0.001, 0.5\}\). During our hyperparameter sweep, we noticed that higher values of trainable parameters β⃗ hampered the learning performance of the RL agents, and this is likely due to their inability to explore the environment. This suggests that for \(\vec{\beta} \rightarrow \infty \), RLRQAOA mimics the behaviour of RQAOA. In other words, RLRQAOA is indeed a generalized variant of RQAOA.
RLRONE
We simulated RLRONE with the same set of hyperparameters and configuration as in RLRQAOA. The only difference is that there are no angles to be learned as the two correlations \(M_{u,v} = 1\) for all \(\{u,v\} \in E\) by design.
EXACT
The exact solutions were computed using the stateoftheart commercial solver Gurobi 9.0 with variable \(MIPGap = 0\) because otherwise the optimization would end prematurely and return a suboptimal solution.
Appendix D: Variational optimization of depth1 QAOA
Given a graph \(G_{n} = (V,E)\) with \(V = n\) vertices,
Now, we want to compute maximum expected energy \(\langle H_{n} \rangle _{1} := { \langle{\Psi _{1}({\alpha }, {\gamma })} \vert } H_{n} { \vert {\Psi _{1}({\alpha }, {\gamma })} \rangle }\) at optimum values of \((\alpha , \gamma )\).
Consider an edge \((u,v) \in E\). We only focus on the contribution of mixer Hamiltonian \(H_{b}\), that is, using Pauli operator commutation rules, we expand the inner conjugation \(U_{m}^{\dagger}(\alpha )Z_{u}Z_{v}U_{m}(\alpha ) = \exp (2i\alpha (X_{u} + X_{v}) ) Z_{u}Z_{v}\). Then using \(X^{2} = I\) this becomes,
By linearity of expectation, we can write the expectation as,
where p, q, r are real coefficients which are unknown complicated functions of γ. We can compute these aforementioned coefficients from the following system of equations.
After the values of p, q, r are known, we can then compute \(\langle H_{n} \rangle _{1}\) over all α by employing elementary trigonometry,
where optimal α can be computed by solving
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Patel, Y.J., Jerbi, S., Bäck, T. et al. Reinforcement learning assisted recursive QAOA. EPJ Quantum Technol. 11, 6 (2024). https://doi.org/10.1140/epjqt/s4050702300214w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1140/epjqt/s4050702300214w