Reinforcement learning assisted recursive QAOA

In recent years, variational quantum algorithms such as the Quantum Approximation Optimization Algorithm (QAOA) have gained popularity as they provide the hope of using NISQ devices to tackle hard combinatorial optimization problems. It is, however, known that at low depth, certain locality constraints of QAOA limit its performance. To go beyond these limitations, a non-local variant of QAOA, namely recursive QAOA (RQAOA), was proposed to improve the quality of approximate solutions. The RQAOA has been studied comparatively less than QAOA, and it is less understood, for instance, for what family of instances it may fail to provide high-quality solutions. However, as we are tackling NP-hard problems (specifically, the Ising spin model), it is expected that RQAOA does fail, raising the question of designing even better quantum algorithms for combinatorial optimization. In this spirit, we identify and analyze cases where (depth-1) RQAOA fails and, based on this, propose a reinforcement learning enhanced RQAOA variant (RL-RQAOA) that improves upon RQAOA. We show that the performance of RL-RQAOA improves over RQAOA: RL-RQAOA is strictly better on these identified instances where RQAOA underperforms and is similarly performing on instances where RQAOA is near-optimal. Our work exemplifies the potentially beneficial synergy between reinforcement learning and quantum (inspired) optimization in the design of new, even better heuristics for complex problems.


I. INTRODUCTION
As quantum computing is becoming practical [11,16,20,33], there has been a growing interest in employing near-term quantum algorithms to help solve problems in quantum chemistry [30], quantum machine learning [2], and combinatorial optimization [14].Any such near-term algorithm must consider the primary restrictions of Noisy Intermediate Scalable Quantum (NISQ) devices; e.g., the number of qubits, decoherence etc. Variational Quantum Algorithms (VQAs) such as the Quantum Approximation Optimization Algorithm (QAOA) [14] were developed as a potential approach to achieve a quantum advantage in practical applications keeping in mind these design restrictions.
For a user-specified input depth l, QAOA consists of a quantum circuit with 2l variational parameters.In the limit of infinite depth, for optimal parameters, the solution of QAOA converges to the optimum for a given combinatorial optimization problem [14].However, a significant body of research has produced negative results [1,7,9,12,13,18,27,28] for QAOA limited to logarithmic depth (in the number of qubits), exploiting the notion of locality or symmetry in QAOA.This motivates the study of techniques that circumvent the restriction of locality or symmetry in QAOA, which exploit the information-processing capabilities of low-depth quantum circuits by employing classical non-local preand post-processing steps 1 .
One such proposal is the recursive QAOA (RQAOA), * These two authors contributed equally. 1 The time complexity of these auxiliary steps should be polynomial in input size for the algorithm to remain practically viable.
a non-local variant of QAOA, which uses shallow depth circuits of QAOA iteratively, and at every iteration, the size of the problem (usually expressed in terms of a graph or a hypergraph) is reduced by one (or more).The elimination procedure introduces non-local effects via the new connections between previously unconnected nodes, which counteracts the locality restrictions of QAOA.The authors in [6][7][8] empirically show that depth-1 RQAOA always performs better than depth-1 QAOA and is competitive to best known classical algorithms based on rounding of a semidefinite programming relaxation for Ising and graph colouring problems.However, given that these problems are NP-hard, there must also exist instances that RQAOA fails to solve exactly, unless NP ⊆ BQP.Hence, to further push the boundaries of algorithms for combinatorial optimization on NISQ devices (and beyond), it is helpful to determine when RQAOA fails, as this can aid in developing better variants of RQAOA.
In this work, we study extensions of RQAOA, which perform better than RQAOA for the Ising problem (or equivalently, the weighted Max-Cut problem, where the external field is zero, refer to Sec.II B).We do this by identifying cases where RQAOA fails (i.e., find smallscale instances with approximation ratio ≤ 0.95).Then, we analyze the reasons for this failure and, based on these insights, we modify RQAOA.We employ reinforcement learning (RL) to not only tweak RQAOA's selection rule, but also train the parameters of QAOA instead of using energy-optimal ones in a new algorithm that we call RL-RQAOA.In particular, the proposed hybrid algorithm provides a suitable test-bed for assessing the potential benefit of RL: we perform simulations of (depth-1) RQAOA, and RL-RQAOA on an ensemble of randomly generated weighted d-regular graphs and show that RL-RQAOA consistently outperforms its counterparts.In the proposed algorithm, the RL component itself plays an integral role in finding the solution, so this raises the question of the actual role of the QAOA circuit and thus potential quantum advantages.To show that the QAOA circuits have a non-trivial contribution to the advantage, we compare RL-RQAOA to an entirely classical RL agent (which, given exponential time, imitates a brute force algorithm) and show that RL-RQAOA converges both faster and to better solutions than the simple classical RL agents.We note that our approach to enhance RQAOA's performance is not limited to depth-1 and can be straightforwardly extended to higher depths.
We present our results as follows: Sec.II introduces QAOA, recursive QAOA (RQAOA), and fundamental concepts behind policy gradient methods in RL.Sec.III presents related works.Sec.IV describes the limitations of RQAOA, and we illustrate their validity by performing numerical simulations.In Sec.V, we provide a sketch of the policies of RL-RQAOA (quantum-classical) and RL-RONE (classical, introduced to characterize the role of quantum aspects of the algorithm) and their learning algorithms.Sec.VI presents our computational results for the comparison between classical and hybrid algorithms (RQAOA, RL-RQAOA, and RL-RONE) on an ensemble of Ising instances.Finally, we conclude with a discussion in Sec.VII.

II. BACKGROUND
In this section, we first provide a brief overview of QAOA (Sec.II A) and its classical simulatability for the Ising problem (Sec.II B).Later, we introduce recursive QAOA (RQAOA) (Sec.II C) upon which we base our proposal for RL-enhanced RQAOA and introductory concepts behind policy gradient in RL (Sec.II D).These notions will give us tools to develop policies based on the QAOA ansatz and their learning algorithms in the upcoming sections.
where the variational parameters {⃗ α, ⃗ γ} ∈ [0, 2π] 2l and the integer l is called the QAOA depth.The depth l controls the non-locality of the QAOA circuit.During the operation of QAOA, these parameters are tuned to optimize the expected value of The preparation of the state (1) is followed by a measurement in the computational basis, which outputs a bitstring x corresponding to a candidate solution of the cost function C. The probability P l (x) of obtaining a bitstring x ∈ {0, 1} n is given by Born's rule, A candidate bitstring x * is called an r-approximation solution to a given instance, for 0 ≤ r ≤ 1 if, An algorithm is said to achieve an approximation ratio of r for a cost function C if it returns an r-approximation or better for every problem instance in the class (i.e., in the worst case).We say that depth-l QAOA achieves an approximation ratio of r for a problem instance of a cost function C if there exists parameters {⃗ α, ⃗ γ} such that We note that repeating a sequence of state preparations and measurements approximates the distribution of x given by (2) and that (4) is the mean of this distribution.The candidate bitstring x * may then be selected to yield the maximum approximation ratio r.

B. Classical Simulatability of QAOA for the Ising problem
Next, we review the classical simulatability of a paradigmatic case of QAOA for the Ising problem.This is a core building block for simulating both (depth-1) RQAOA and RL-RQAOA.It enables their efficient classical simulation at depth-1 for arbitrary graphs.Given a graph G n = (V, E) with n vertices V = [n] (where [n] = {1, 2, . . ., n}) and edges E ⊂ V × V , as well as an external field h u ∈ R and a coupling coefficient (edge weight) J uv ∈ R associated with each vertex and edge respectively, then the Ising problem aims to find a spin configuration s ∈ {−1, +1} n maximizing the cost Hamil-

QPU/CPU
Training QAOA-based policies for reinforcement learning.We consider an RL-enhanced recursive QAOA (RL-RQAOA) scenario where a hybrid quantum-classical agent learns by interacting with an environment which we represent as a search tree induced by the recursive framework of RQAOA.The agent samples the next action a (corresponding to selecting an edge and its sign) from its policy π θ (a|s) and receives feedback in the form of a reward r, where each state corresponds to a graph (the state space is characterized by a search tree of weighted graphs, where each node of the tree corresponds to a graph).The nodes at each level of the search tree correspond to the candidate states for an agent to perceive by taking action.For our hybrid agents, the policy π θ of RL-RQAOA (see Def. 1) along with the gradient estimate ∇ θ log π θ is evaluated on a CPU as we are in the regime where depth l = 1.However, the policy can also be evaluated on a quantum processing unit (QPU) for higher depths, when classical simulations can only be performed efficiently for graphs of small size.The training of the policy is performed by a classical algorithm such as REINFORCE (see Alg. 1), which uses sample interactions and policy gradients to update parameters.
tonian2 , The Ising problem without any external field is equivalent to the weighted Max-Cut problem, where the goal is to find a bi-partition of vertices such that the total weight of the edges between them is maximized.The expected value of each Pauli operator Z u and Z u Z v on depth-1 QAOA can be computed classically in O(n) time using analytical results stated in Theorem 1 in Appendix A.
Since the cost function has O(n 2 ) many terms in the worst case, computing the final expected value of (5) hence takes a total time in O(n3 ) given the variational parameters.
C. Recursive QAOA In this subsection, we outline the RQAOA algorithm of Bravyi et al. [7] for the Ising problem as defined in (5) with no external fields (h u = 0, ∀u ∈ V ).This will serve as a base for our proposal of RL-enhanced RQAOA.The RQAOA algorithm aims to approximate the maximum expected value 3 max x ⟨x| H n |x⟩, where x ∈ {0, 1} n .It consists of the following steps.First, a standard depth-l QAOA is executed to find the quantum state |Ψ * l (⃗ α, ⃗ γ)⟩ (with optimal variational parameters) as in (1) that maximizes the expectation value of H n .For each edge (u, v) ∈ E, the two-correlation | is then eliminated (breaking ties arbitrarily) by imposing the constraint which yields a new Ising Hamiltonian H n−1 with at most n − 1 variables.The resulting Hamiltonian is processed iteratively, following the same steps.Finally, this iterative process stops once the number of variables is below a predefined threshold n c .The remaining Hamiltonian with n c variables can then be solved using a classical algorithm (e.g., brute force method).The final solution can then be obtained iteratively by reconstructing eliminated variables using (6).We note that the variable elimination scheme in RQAOA is analogous to rounding solutions obtained by solving continuous relaxations of combinatorial optimization problems.We refer the interested reader to [29, Sec.V.A.] for a detailed discussion on the connection between quantum optimization algorithms and classical approximation algorithms.Recall that the final expected value of H n as in ( 5) can be computed in O(n 3 ) time.Since we can choose n c such that n c ≈ O(1), RQAOA runs for approximately n iterations, so that the total running time is O(n 4 ) (neglecting the running time needed to select the variational parameters).

D. Reinforcement Learning Primer
As our proposal to improve upon RQAOA is based on reinforcement learning, we introduce basic concepts behind RL and the policy gradient method in this subsection.
In RL, the agent learns an optimal policy by interacting with its environment using a trial-and-error approach [36].Formally, RL can be modeled as a Markov Decision Process (MDP) defined by the tuple (S, A, p, R), where S and A represent the state and action spaces (both can be continuous and discrete), the function p : S × S × A → [0, 1] defines the transition dynamics, and R : S × A → R describes the reward function of the environment.An agent's behaviour is governed by a stochastic policy π θ (a|s) : S × A → [0, 1], for a ∈ A and s ∈ S. Highly expressive function approximators, such as deep neural networks (DNN), can be used to parametrize a policy π θ using tunable parameters θ ∈ R d .An agent's interaction governed by a policy π θ (a|s) in the environment can be viewed as sampling a trajectory τ ∼ P E (•) from the MDP, where is the probability of the trajectory τ of length H to occur, where p 0 is a distribution of initial state s.An example of a trajectory is τ = (s, a, s 1 , a 1 , . . ., s H−1 , a H−1 , s H ).An agent collects a sequence of rewards based on its interactions with the environment.The metric that assesses an agent's performance is called the value function V π θ and takes the form of a discounted sum as follows, where s is an initial state of an agent's trajectory τ within an environment, P E describes the environment dynamics (i.e., in the form of an MDP)), and r t is the reward at time step t during the interaction.Every trajectory has a horizon (length) H ∈ N ∪ {∞} and the expected return involves a discounting factor γ ∈ [0, 1].Most often one chooses γ < 1 to avoid unwanted diverging value functions for a horizon H = ∞.Finally, the goal of an RL algorithm is to learn an optimal policy π * θ such that the value function is maximized for each state.One way of finding a good policy is through the policy gradient method, i.e., finding an optimal set of parameters θ which maximize the value function of the policy (by evaluating its gradient).For the sake of brevity, we defer the explanation of the policy gradient method to Appendix B.

III. RELATED WORK
In the context of RL, two works [35,41] developed optimizers based on policy gradient methods for VQA optimization, highlighting the robustness of RL-based techniques against off-the-shelf optimizers in the presence of noise.As opposed to our work, both these works use an external RL policy to choose the angles of QAOA in a one-step Markov Decision Process (MDP) environment, and otherwise rely on the basic QAOA algorithm.A series of works [43,44] have also used RL-based optimization to generalize the approach of QAOA for preparing the ground state of quantum many-body problems.In [43], an agent uses an auto-regression mechanism to sample the gate unitaries in a one-step MDP and employs an off-the-shelf optimizer to optimize angles to prepare a generalized QAOA ansatz.The same set of authors then unify their previous works [41,43] with both the use of a generalized autoregressive architecture that incorporates the parameters of the continuous policy and an extended variant of Proximal Policy Optimization (PPO) applicable to hybrid continuous-discrete policies [42].We note that for all the works [41][42][43][44], the quantum circuit (QAOA-type ansatz) is a part of an environment.In our case, we focus on employing reinforcement learning to enhance the performance of the RQAOA, inspired by a recent work [19] on using quantum circuits to design RL policies.In contrast to the approaches discussed above, we design an RL policy based on QAOA ansatz in a multi-step MDP environment where the quantum circuit (QAOA ansatz) is not a part of the environment.Other works have used Q-learning to formulate QAOA into an RL framework to solve difficult combinatorial problems [38] and in the context of digital quantum simulation [22].
In the context of employing non-local post-processing methods in quantum optimization algorithms akin to classical iterated rounding, there have been a few proposals to modify RQAOA.The main idea behind RQAOA is to use QAOA iteratively to compute correlations and then, at every iteration, employ a rounding (variable elimination) procedure to reduce the size of the problem by one.The variants of RQAOA proposed in the literature primarily differ in how the correlations are computed and how the variables are eliminated.For instance, in [7,8], variable elimination scheme of RQAOA is deterministic and relies on correlations between qubits (qudits).On the other hand, the authors in [29, Sec.V.A.] propose a modified RQAOA where the rounding procedure is stochastic (controlled by a fixed hyperparameter β), and a variable is eliminated based on individual spin polarizations.In contrast, our proposal of RL-RQAOA trains analogous parameter(s) ⃗ β via RL (See Appendix C) and uses correlations between qubits to perform variable elimination.Note Added: Several pre-prints on iterative/recursive quantum optimization algorithms generalizing RQAOA have appeared since the submission of this work on arXiv.Parallel works such as [4,10,15] widen the selection and variable elimination schemes within the framework of recursive quantum optimization in application to constrained problems such as Maximum Independent Set (MIS) and Max-2-SAT.Moreover, [4] show theoretical justifications of why depth-1 QAOA might not be a suitable candidate for quantum advantage and consequently urge the community to explore higher depth alternatives.

IV. LIMITATIONS OF RQAOA
This section highlights some algorithmic limitations of RQAOA by introducing an alternative perspective on it.Then, based on this perspective, we provide insights into when RQAOA might fail and why.It is obvious that (depth-1) RQAOA must fail on some instances, since we assume BPP ⊊ NP4 , but these instances may be quite big a priori.By "failure", we mean that RQAOA can not find an optimal (exact) solution.Notably, even if depth-l RQAOA fails to find exact solutions, it could still achieve an approximation ratio better than the bound known from inapproximability theory.In this case, N P ⊆ BQP still holds.For instance, if RQAOA fails to find an exact solution and still achieves an approximation ratio of 16/17 + ε or 0.8785 + ε, then N P ⊆ BQP follows from [17,23] under different complexity-theoretic assumptions, thus demonstrating quantum advantage.We primarily focus on finding small-size instances since we need a data set of small instances to be able to computationally efficiently compare the performance of (depth-1) RQAOA and RL-RQAOA.
First, let us motivate the use of QAOA as a subroutine in RQAOA.In other words, why would one optimize depth-l QAOA (i.e., find energy-optimal parameters for a Hamiltonian) and then use it in a completely different way (i.e., perform variable elimination by computing two-correlation coefficients M u,v ).Intuitively, using QAOA in such a fashion makes sense because as depth l → ∞, the output of QAOA converges to the quantum state which is the uniform superposition over all optimal solutions, and hence, for each pair (u, v) ∈ E, computing the coefficient M u,v exactly predicts if the edge is correlated (vertices with the same sign; i.e. lie in the same partition) or anti-correlated (vertices with the different sign; i.e. lie in the different partition) in an optimal cut.The next piece of intuition, which is not any kind of a formal argument, is that low-depth QAOA prepares a superposition state where low-energy states are more likely to have high probability amplitudes.Then the RQAOA selects the edge which is most correlated or anti-correlated in these low-energy states.Furthermore, assuming that an ensemble of reasonable solutions often agree on which edges to keep and which ones to cut, RQAOA will select good edges to cut or keep (from the Max-Cut perspective).However, we also expect RQAOA to fail sometimes, for instance, when the intuition mentioned above is wrong, or it assigns a wrong edge-correlation sign to an edge for other reasons.Hence, as RQAOA fails, this raises the question of whether there are better angles to select an edge and its correct edge-correlation sign at every iteration than those which coincide with energyoptimal angles (see Fig. 2).
RQAOA can alternatively be visualized as performing a tree search to find the most probable spin configuration close to the ground state of the Ising problem.In particular, at the k th level of the tree, nodes correspond to graphs with n − k vertices, each having different edge sets.Suppose that a node has n−k vertices with e edges, then it will have e many children where each child corresponds to a graph with n − k − 1 vertices having different edge sets following the edge contraction rules by imposing (6).The original RQAOA proposal [7] is a randomized algorithm (in the sense that ties between maximal two-correlation coefficients are broken uniformly at random) on this tree exploring only a single path during one run and terminating at the (n − n c ) th level.The decision of choosing an appropriate branch is performed based on the largest magnitude of the absolute value of two-correlation coefficients M u,v computed via a depth-l QAOA using H n -energy optimal parameters.While exploring level-by-level, RQAOA assigns the edge correlations (−1 or + 1) where a vertex is eliminated according to the constraint (6).We note that in the case of ties between maximal two-correlation coefficients, independent runs of RQAOA might not necessarily induce the same search tree.
This alternative perspective described above provides some insights regarding the limitations of RQAOA: (i) when there are ties and branching occurs, it could be that  Edges of the graph (b) Energy optimal QAOA angles Correct choice for edge (0,2) Mistake for edge (0,2) Results for other edges Energy optimal QAOA angles Correct choice Mistake FIG. 2. Illustration of a counterexample where the heuristic of using the energy-optimal QAOA angles in RQAOA fails.Here, we show that for the weighted graph (9 vertices and 24 edges) depicted in (a), RQAOA makes a mistake even in its strongest regime, so at the very first iteration (i.e., nc = 8).The two-correlation coefficients for each edge (at energy-optimal angles) are shown in the form of a horizontal bar plot in (b), where the edge (0, 2) has the maximal correlation coefficient.For the graph in (a), RQAOA with energy-optimal angles assigns a wrong edge-correlation (sign) to this edge which is precisely highlighted by a bold star in (c) and (d).Both (c) and (d) characterize the sets of good and bad QAOA angles where RQAOA makes a correct and a wrong choice, respectively.This example is counter-intuitive: as the edge (0, 2) has the highest weight in the graph, intuitively, the variables should be correlated (same sign) as to maximize the energy.However, this leads to a sub-optimal solution which RQAOA achieves with energy-optimal angles.Yet, for different settings of QAOA angles which do not maximize the overall energy, this edge will still have the largest magnitude of correlation, but in this case, anti-correlation, which leads to the true optimum (see sub-figure (c)).
only one path within a set of induced search tree leads to a good approximate solution; and (ii) it may be the case that even when there are no ties (i.e., one path and no branching), selecting edges to contract according to the maximal correlation coefficient stemming from energyoptimal parameters of QAOA is an incorrect choice to attain a good solution.A priori, it is not obvious if any of the above mentioned two possibilities can occur under the choice of energy-optimal angles.However, note that one of (i), (ii), or a combination of both must happen; otherwise, RQAOA is an efficient polynomial-time algorithm for the Ising problem.Hence, in the case that RQAOA makes an incorrect choice, RQAOA lacks the ability to explore the search tree to find better approximate solutions.Keeping these considerations in mind, we will show later that both phenomena (i) and (ii) occur by performing an empirical analysis of RQAOA.We now describe both the limitations mentioned above in detail below.
(i) It may be the case that eliminating a variable by taking the argmax of the absolute value of twocorrelation coefficients is always a correct choice, but there can be more than one choice at every iteration.Moreover, it is possible to construct instances with a small number of optimal solutions, where for the majority of n − n c iterations (cor-responding to the level of the tree) there is at least one tie (here, m ties corresponds to m + 1 pairs (u 1 , v 1 ), . . ., (u m+1 , v m+1 ) with the same twocorrelation coefficient).In other words, the number of times RQAOA needs to traverse the search tree in the worse case to reach the ground state (optimum) may be exponentially large; i.e., every argmax tie break leads to a new branching of the potential choices of RQAOA, and this happens at each level of the tree.We showcase this phenomenon in our empirical analysis for one such family of instances (see Fig. 3).One may imagine perturbing the edge weights to avoid ties while preserving the ground states of the Hamiltonian, but no such perturbation is generally known.
(ii) It may be the case that the path to reach the ground state requires the selection of a pair (u, v) (and its correlation sign) for which the two-correlation coefficient is not maximal according to QAOA at energy-optimal parameters (see Fig. 2).This implies that RQAOA might be prematurely locking out on optimal solutions.We provide examples of graphs to prove the validity of the observations above.In the regime where there are ties between maximal correlation coefficients [(i)], we performed 200 independent RQAOA runs for the fam-ily of weighted (d, g)-cage graphs5 (3 ≤ d ≤ 7; 5 ≤ g ≤ 12; edge weights {−1, +1}) where ties are broken uniformly at random for the n − n c iterations (levels of the tree).We work with these graphs because the subgraphs that (depth-1) QAOA sees are regular trees (for most edges at every iteration of RQAOA, QAOA will see a (d − 1)-ary tree, as cage graphs are d-regular graphs, which creates the situation of ties between correlation coefficients).Here, by seeing we refer to the fact that the output of depth-l QAOA for a qubit (vertex) only depends on the neighbourhood of qubits that are within l distance to the given qubit [13].For these graphs, we found that in 86.4 ± 9.63% of the n − n c iterations, the variable to eliminate was chosen from the ties between maximal correlation coefficients (see Fig. 3).
To investigate the scenario of [(ii)], we focus on a particular case where there are no ties (or comparatively less ties) and find instances such that taking the maximal two-correlation coefficient does not reach the optimum solution in the tree.For this, we performed a random search over an ensemble of 10600 weighted random d-regular graphs and found several small-size instances (#nodes ≤ 30) for which RQAOA did not attain the optimum.
Using both the theoretical and numerical observations discussed above, we create a dataset of graphs (containing both hard and random instances for RQAOA) for our later analysis.In the next section, we develop our new algorithm (RL-RQAOA) and compare its performance to RQAOA to assess the benefit of employing reinforcement learning in the context of recursive quantum optimization specifically for hard instances.Finally, we give the relevant details about the data set of the graph ensemble considered in Sec.VI A.

V. REINFORCEMENT LEARNING ENHANCED RQAOA & CLASSICAL BRUTE FORCE POLICY
Having introduced the background of policy gradient methods and the limitations of RQAOA, we develop a QAOA-inspired policy which selects a branch in the search tree (eliminate a variable) at every iteration of RL-RQAOA.Recall that, even though selecting an edge to contract according to the maximal two-correlation coefficient is often a good choice, it is not always an optimal one, and also often, there is no single best option, but more (for instance, see Fig. 2).Our basic idea is to train an RL method to learn how to select the edges to contract (along with its edge-correlation sign) correctly while using the information generated by QAOA.Additionally, to investigate the power of the quantum circuit within the quantum-classical arrangement of RL-RQAOA, we design a classical analogue of RL-RQAOA called reinforcement-learning recursive ONE (RL-RONE) and compare it with RL-RQAOA.
To overcome the limitations of RQAOA, one needs to carefully tweak (a) RQAOA's variable elimination subroutine and (b) the use of QAOA as a subroutine; i.e., instead of finding energy-optimal parameters, we learn the parameters of QAOA.For (a), we apply the non-linear activation function softmax ⃗ β (see Def. 1) on the absolute value of two-correlation coefficients |M u,v | measured on |Ψ l (⃗ α, ⃗ γ)⟩.By doing this, the process of selecting a variable to eliminate (and its sign) is represented by a smooth approximation of argmax that is controlled by a vector of trainable inverse temperature parameters ⃗ β (one β per edge).The parameters ⃗ β (initialized at low values) are then trained such that the probability of selecting an edge (or a branch at every iteration) with the highest expected reward tends to 1.In the case of (b), we train the variational angles of QAOA in the course of learning rather than using the ones that give optimal energy.We do this because of the following two reasons: (i) to avoid costly optimization6 ; (ii) different angle choices can help the algorithm sometimes to choose optimal paths in the search tree that are not possible otherwise (see Fig. 2).We note that the entire learning happens on one instance of the Ising problem.Even though it is conceivable to train the algorithm over an ensemble of instances by introducing suitable generalization mechanisms such that ⃗ β are dependent on instances, we solely focus on learning parameters of the policy of RL-RQAOA on one instance so that it eventually performs better than RQAOA. 1 Initialize the policy parameters θ.
2 while True do Compute the gradients ∇ θ log π θ (a To provide further details on the effective Markov Decision Process (MDP) that the above described policy will be exploring, note that the RQAOA method can be interpreted as a multi-step (also called a n-step) MDP environment (with a delayed reward and a non-trainable policy), where at every iteration, a variable is eliminated based on the information generated by QAOA.Let us now cast the learning problem of variable elimination in the RL framework, inspired by recent work [19] on using quantum circuits to design RL policies.For every step of the episode7 , our RL agent is required to choose one action out of the discrete space equivalent to an edge set of the underlying graph; i.e., in the worse case, selects one edge from n 2 , on which it imposes a constraint of the form (6). Hence, the state space S consists of weighted graphs (which we could encounter during an RQAOA run) and the action space A consists of edges (and ±1 edge-correlations to impose on them).The actions are selected using a parameterized policy π θ (a|s) which is based on the QAOA ansatz.Since, we use the expectation value of the Hamiltonian H n of the Ising problem as an objective function, the reward space is Next, we formally define the policy of RL-RQAOA and its learning algorithm, which is a crucial part of RL-RQAOA.
Definition 1 (Policy of RL-RQAOA) Given a depth-l QAOA ansatz acting on n qubits, defined by a Hamiltonian H n (with an underlying graph , ⃗ γ)⟩ be the two-correlations that it generates.We define the policy of RL-RQAOA as where actions a correspond to edges (u, v) ∈ E(G n ), states s to graphs G n and β u,v ∈ R (exists for every possible edge) is an inverse temperature parameter.Here, θ = (⃗ α, ⃗ γ, ⃗ β) constitutes all trainable parameters, where The reader is referred to Alg. 2 for the pseudo-code of RL-RQAOA (for one episode), where the addition of RL components are highlighted in the shade of green.Furthermore, we note that RL-RQAOA is a generalized version of RQAOA because the former is exactly equivalent to the latter when the energy-optimal parameters {⃗ α, ⃗ γ} are specified by QAOA on H n , and for all (u, v) ∈ E, β u,v ∈ ⃗ β, where β u,v = ∞.Since the vector ⃗ β is edge specific and as we learn β u,v ∈ ⃗ β for {u, v} ∈ E separately for every instance, we develop a fully classical RL algorithm, namely RL-RONE, to simply learn β u,v for all edges directly in spite of where the two-correlation coefficients M u,v are generated from.It is natural to consider this because it might be the case that in the hybrid quantumclassical arrangement of RL-RQAOA, the classical part (learning of β u,v for {u, v} ∈ E) is more powerful than the quantum part (computing two-correlations M u,v for {u, v} ∈ E from the QAOA ansatz at given variational angles {⃗ α, ⃗ γ}).Hence, in order to assess the contribution of the quantum circuit in RL-RQAOA, we define the policy of RL-RONE such that for each edge, we fix the two-correlation M u,v = 1; i.e., we do not use any output from the quantum circuit.Although simply using M u,v = 1 in Def. 1, the policy will select an edge and always assign it to be correlated, rendering it to be less expressive.A solution to this problem is to simultaneously learn the parameters β +1 u,v (correlated edge) and β −1 u,v (anti-correlated edge) for each edge.Then the resulting RL-RONE algorithm is expressive enough to reach the optimum solution.Moreover, it has trainable inverse temperature parameters ⃗ β where | ⃗ β| = n 2 − n for n the number of nodes of the graph G n .The notion of an action slightly differs from the RL-RQAOA policy as the action here corresponds to selecting an edge along with its sign (+1 and −1 for correlated and anti-correlated edges, respectively), while in RL-RQAOA, the two-correlation coefficient implicitly selects this sign.We formally define the policy of RL-RONE below.
Definition 2 (Policy of RL-RONE) Given a Hamiltonian H n (with an underlying graph G n = (V, E)), we define the policy of RL-RONE as where actions a correspond to edges (u, v) ∈ E(G n ) along with an edge correlation b ∈ {±1}, states s correspond to graphs G n and β ±1 u,v ∈ R (exists for every edge) are inverse temperature parameters.Here, θ = ( ⃗ β +1 , ⃗ β −1 ) constitutes all trainable parameters, where The classical analogue RL-RONE can then be simulated by performing the following modifications to Alg. 2: (i) modify the parameters θ of the policy of RL-RONE by θ = ( ⃗ β +1 , ⃗ β −1 ), (ii) delete Lines 4 and 5, (iii) update Line 6 by incorporating the policy of RL-RONE and the constraint (6) in Line 7 is imposed by feeding the correlation sign of the edge from the output (b ∈ {±1}) of the policy of RL-RONE.
We train both the policies of RL-RQAOA and RL-RONE using the Monte Carlo policy gradient algorithm REINFORCE, as explained in Appendix B. Also, refer to Alg. 1 for the pseudo-code.The horizon (length) of an episode is n − n c .The value function is defined as where γ ∈ [0, 1], H n is the Hamiltonian defined on n variables for a problem instance and x is a binary bitstring as defined in Line 14 of Alg. 2.
In this work, we only focus on simulations of depth-1 RQAOA and RL-RQAOA.Indeed, the particular case of depth-1 quantum circuits and Ising Hamiltonian RQAOA can be simulated efficiently classically; see Sec.II B and Appendix A. However, classical simulatibility is not known for Ising cost functions at depth larger than 2 [6], and more general cost Hamiltonians even at depth-1 (e.g., Max-k-XOR on arbitrary hypergraphs), leaving room for both quantum and RL-enhanced quantum advantage.
VI. NUMERICAL ADVANTAGE OF RL-RQAOA OVER RQAOA In the previous section, we have introduced both our quantum (inspired) policy of RL-RQAOA and an entirely d3-g6 d3-g7 d3-g8 d3-g9 d3-g10 d3-g11 d3-g12 d4-g5 d4-g6 d4-g7 d4-g8 d5-g5 Cage Graph Instances Comparison of success probability in attaining ground state solutions of RL-RQAOA and RQAOA on cage graphs.The x-axis depicts the properties of cage graph(s), for instance, d3-g6 denotes that the instance is 3-regular with girth (length of the shortest cycle) being 6.The error-bars appear only for few instances (specifically for d3-g9, d3-g10 and d5-g5) because of the existence of multiple graph instances with the same properties (degree and girth).The evaluation of RL-RQAOA was done by evaluating the average learning performance over 15 independent runs.While, for RQAOA, the best energy is taken when given a fixed budget of 1400 runs.The probability for RL-RQAOA-max is computed by taking the maximum energy attained by the agent over all 15 independent runs for a particular episode.One the other hand, the probability for RL-RQAOA-vote (statistically more significant) is computed by aggregating the maximum energy attained for a particular episode only if more than 50% of the runs agree.We chose nc = 8 for instances with nodes ≤ 50 and nc = 10 otherwise.The parameters θ = (α, γ, ⃗ β) of the RL-RQAOA policy were initialized by setting ⃗ β = {25} (n 2 −n)/2 and the angles {α, γ} (at every iteration) to energy-optimal angles (i.e., by following one run of RQAOA).All agents were trained using REINFORCE (Alg.1).
classical policy of RL-RONE, and their design choices, and based on these; we propose an RL-enhanced RQAOA and its classical analogue RL-RONE.Although we gave justifications for these choices, it is natural to evaluate their influence on the performance of RL-RQAOA and RL-RONE.In this section, we first describe how we found hard instances for RQAOA and discuss their properties.We then describe the results of our numerical simulations, where we consider both hard instances and random instances to benchmark the performance of (depth-1) RQAOA, RL-RQAOA, and RL-RONE.The reader is referred to Appendix C for implementation details for the above algorithms.

A. Hard Instances for RQAOA
Here, our focus is on finding small-size hard instances (with approximation ratio as a metric) for the Ising problem where RQAOA fails.Note that, we assume it must fail to solve exactly as if it does not, then NP ⊆ BQP as the Ising problem is NP-hard in general.As we lack techniques to analyze the performance guarantees of RQAOA at arbitrary depth l apart from special cases like "ring of disagrees" at depth-1 [7], it is a non-trivial task to find hard instances for RQAOA.In this spirit, we generate an ensemble G[n, d, w] of weighted random d-regular instances with n vertices and edge weight distribution w : E → R. We then perform a random search over G[n, d, w] to find hard instances.Concretely, we construct a graph ensemble G[n, d, w] as follows: for each tuple of parameters (n, d, w) ∈ {14, 15, . . ., 30} × {3, 4, . . ., 29} × {Gaussian, bimodal}, we generate 25 graphs whenever possible8 yielding 10600 graphs in total, where Gaussian (N (0, 1)) and bimodal ({±1}) are edge weight distributions.Intuitively, the instances with bimodal edge weights would have a huge level of degeneracy within the ground states, which is confirmed by our simulations.Moreover, for the instances with bimodal edge weights, where ties between two-correlation coefficients were encountered, the final approximation ratio was computed based on the best energy attained by running RQAOA for a maximum of 1400 independent runs.On the other hand, for the instances with Gaussian edge weights N (0, 1), we found that all instances had unique ground states.Hence, we ran RQAOA only once to get the best approximation ratio for instances with Gaussian edge weights.
We filter out 1027 (857 with bimodal weights and 170 with Gaussian weights) instances for which RQAOA's approximation ratio is less than 0.95.Note that RQAOA can only be closer to optimal the larger n c is.In other words, it monotonically improves the quality of the solution with an increase in n c .Since we want to improve upon RQAOA in its strongest regime, we choose n c = 8 (unless specified otherwise) for our numerical simulations.However, interestingly for the 1027 hard instances found above, even with n c = 4, we only found 26 instances (5 with bimodal weights and 21 with Gaussian weights) for which the approximation ratio decreased (for the rest, the approximation ratio remained the same).We chose n c = 4 for the previously mentioned experiment because, for some instances, the edge weights cancelled out after an edge contraction subroutine, and as a consequence, the intermediate graph ended up being an empty graph (a graph with zero edge weights) for 1 ≤ n c < 4.

RQAOA vs RL-RQAOA on Cage Graphs
In our first set of experiments, illustrated in Fig. 4, we compare the performance of RL-RQAOA with RQAOA on random Ising instances derived from (d, g)-cage graphs (3 ≤ d ≤ 7; 5 ≤ g ≤ 12; edge weights {−1, +1}).The aim of this experiment is twofold: first, to show that RL-RQAOA does not perform much worse than RQAOA on instances where the latter performs quite well; second, to test the advantage of RL-RQAOA over RQAOA in terms of the probability of attaining the optimal solution when there are many ties between two-correlation coefficients M u,v at every iteration.Notably, we already demonstrated earlier (see Fig. 3) that for cage graphs, RQAOA has a constant number of ties between maximal two-correlation coefficients for the majority of the n − n c iterations.For assessing our hypotheses, we evaluate the average learning performance over 15 independent RL-RQAOA runs over 1400 episodes.In order to fairly compare RL-RQAOA with RQAOA, we run RQAOA independently for 1400 runs and choose the best solution from the result these runs.Note that, this is a more powerful heuristic than the vanilla-RQAOA (which outputs the first solution it finds) where the hyperparameter (the number of independent runs) controls the solution quality.Both RL-RQAOA (vote variant) and RQAOA fail to reach the optimum for (3, 12)-cage graph within the given budget (see Fig. 4).However, by evaluating the resulting learning curves of RL-RQAOA, both our hypotheses can be confirmed for most of the instances.

RQAOA vs RL-RQAOA on hard instances
For the next set of experiments, presented in Fig. 5, the flavour here is similar to the previous experiment but with the aim to show separation between RL-RQAOA and RQAOA for hard instances found in Sec.VI A. More specifically, we show that RL-RQAOA always performs better than RQAOA on these instances in terms of the best approximation ratio achieved.We do this by evaluating average learning performance over 15 independent RL-RQAOA runs to assess this claim.Interestingly, RL-RQAOA outperformed RQAOA even when the angles of We chose nc = 10 and nc = 18 for 100 and 200 nodes, respectively in our simulations and the parameters θ = (α, γ, ⃗ β) of the RL-RQAOA policy were initialized by setting ⃗ β = {25} (n 2 −n)/2 and the angles {α, γ} (at every iteration) to energy-optimal angles (i.e., by following one run of RQAOA).All agents were trained using REINFORCE (Alg.1).
the QAOA circuit were initialized randomly.

RL-RQAOA vs RL-RONE
However, the results in the previous two subsections do not indicate the importance of the quantum part in the quantum-classical arrangement.To address this, we performed a third set of experiments, presented in Fig. 6 where both algorithms were tested on random 3-regular graphs of 100 and 200 nodes.By comparing the performance of RL-RONE with RL-RQAOA, we can see a clear separation between learning curves of the agents of these algorithms, highlighting the effectiveness of the quantum circuits in solving the Ising problem.

VII. DISCUSSION AND CONCLUSION
In this work, we analyzed the bottlenecks of a non-local variant of QAOA, namely recursive QAOA (RQAOA), and based on this, propose a novel algorithm that uses reinforcement learning (RL) to enhance the performance of the RQAOA (RL-RQAOA).In the process of analyzing the bottlenecks of RQAOA for the Ising problem, we find small-size hard Ising instances from a graph ensemble of random weighted d-regular graphs.To avoid missing out on better optimal solutions at every iteration, we cast the variable elimination problem within the RQAOA as a reinforcement learning framework; we introduce a quantum (inspired) policy of RL-RQAOA, which controls the task of switching between exploitative or exploratory behaviour of RL-RQAOA.We demonstrate via numerical simulations that formulating RQAOA into the RL framework boosts the performance and performs as well as RQAOA on random instances and beats RQAOA on all hard instances we have identified.Finally, we note that all the numerical simulations for RQAOA (depth-1) and the proposed hybrid algorithm RL-RQAOA (depth-1) were performed classically, and no quantum advantage is to be expected unless we simulate both of them at higher depths.An interesting follow-up to this work would be to assess the performance of both RQAOA and RL-RQAOA at higher depths on an actual quantum processing unit (QPU) in both noise and noise-free regimes.
the Dutch Research Council (NWO/OCW), as part of the Quantum Software Consortium programme (project number 024.003.037).VD acknowledges the support by the project NEASQC funded from the European Union's Horizon 2020 research and innovation programme (grant agreement No 951821).VD also acknowledges support through an unrestricted gift from Google Quantum AI.

AUTHOR CONTRIBUTIONS
YJP and SJ contributed equally to this work.YJP, SJ, and VD designed all the experiments.The manuscript was written with contributions from all authors.All authors read and approved the final manuscript.This theorem enables the estimation of the gradient of the value function analytically, whose evaluation in the context of sample complexity scales only logarithmic in the parameters θ of the policy [21].
Moreover, one can estimate the value function terms V π θ (s t ) in (B2) by collecting rewards from the Monte Carlo rollouts (as defined in (B1)).This learning algorithm is called the Monte Carlo Policy Gradient algorithm, otherwise known as REINFORCE [36,39].In the literature, there exist other sophisticated approaches, such as the actor-critic method [25], where the value function is estimated using an additional approximator such as a deep neural network (DNN).design.
EXACT: The exact solutions were computed using the state-of-the-art commercial solver Gurobi 9.0 with variable

A
. Quantum Approximate Optimization Algorithm QAOA seeks to approximate the maximum of the binary cost function C : {0, 1} n → R encoded into a Hamiltonian as H n = x∈{0,1} n C(x) |x⟩ ⟨x|.Starting from an initial state |s⟩ = |+ n ⟩ (uniform superposition state), QAOA alternates between two unitary evolution operators U p (γ) = exp(−iγH n ) (phase operator) and U m (α) = exp(−iαH b ) (mixer operator) respectively, where H b = n j=1 X j .Hereafter, X, Y, Z are standard Pauli operators and P j is a Pauli operator acting on qubit j for P ∈ {X, Y, Z}.The phase and mixer operator are typically applied a total of l times, generating the quan-tum state, c   | ∇ θ log  Policy gradient algorithm e.g., REINFORCE (refer Alg. 1)

Algorithm 1 :
REINFORCE algorithm for the policies of RL-RQAOA and RL-RONE Input: The policy of RL-RQAOA (Def. 1) or RL-RONE (Def.2) FIG. 4.Comparison of success probability in attaining ground state solutions of RL-RQAOA and RQAOA on cage graphs.The x-axis depicts the properties of cage graph(s), for instance, d3-g6 denotes that the instance is 3-regular with girth (length of the shortest cycle) being 6.The error-bars appear only for few instances (specifically for d3-g9, d3-g10 and d5-g5) because of the existence of multiple graph instances with the same properties (degree and girth).The evaluation of RL-RQAOA was done by evaluating the average learning performance over 15 independent runs.While, for RQAOA, the best energy is taken when given a fixed budget of 1400 runs.The probability for RL-RQAOA-max is computed by taking the maximum energy attained by the agent over all 15 independent runs for a particular episode.One the other hand, the probability for RL-RQAOA-vote (statistically more significant) is computed by aggregating the maximum energy attained for a particular episode only if more than 50% of the runs agree.We chose nc = 8 for instances with nodes ≤ 50 and nc = 10 otherwise.The parameters θ = (α, γ, ⃗ β) of the RL-RQAOA policy were initialized by setting ⃗ β = {25} (n 2 −n)/2 and the angles {α, γ} (at every iteration) to energy-optimal angles (i.e., by following one run of RQAOA).All agents were trained using REINFORCE (Alg.1).
Numerical evidence of the advantage of RL-RQAOA over RQAOA in terms of approximation ratio on hard instances.The box plot is generated by taking the mean of the best approximation ratio over 15 independent runs of 1400 episodes for RL-RQAOA.The RL-RQAOA clearly outperforms RQAOA in terms of approximation ratio for the instances considered (these are exactly the instances where RQAOA's approx.ratio ≤ 0.95).We chose nc = 8 in our simulations and the parameters θ = (α, γ, ⃗ β) of the RL-RQAOA policy were initialized by setting ⃗ β = {25} (n 2 −n)/2 and the angles {α, γ} (at every iteration) were initialized randomly.All agents were trained using REINFORCE (Alg.1).