 Research
 Open Access
 Published:
Robustness of quantum reinforcement learning under hardware errors
EPJ Quantum Technology volume 10, Article number: 8 (2023)
Abstract
Variational quantum machine learning algorithms have become the focus of recent research on how to utilize nearterm quantum devices for machine learning tasks. They are considered suitable for this as the circuits that are run can be tailored to the device, and a big part of the computation is delegated to the classical optimizer. It has also been hypothesized that they may be more robust to hardware noise than conventional algorithms due to their hybrid nature. However, the effect of training quantum machine learning models under the influence of hardwareinduced noise has not yet been extensively studied. In this work, we address this question for a specific type of learning, namely variational reinforcement learning, by studying its performance in the presence of various noise sources: shot noise, coherent and incoherent errors. We analytically and empirically investigate how the presence of noise during training and evaluation of variational quantum reinforcement learning algorithms affect the performance of the agents and robustness of the learned policies. Furthermore, we provide a method to reduce the number of measurements required to train Qlearning agents, using the inherent structure of the algorithm.
1 Introduction
Quantum machine learning (QML) is advertised as one of the most promising candidates for a nearterm advantage in quantum computing [1]. The variational quantum algorithms (VQAs) that are used for this are trained in a hybrid fashion, where a classical optimizer is used to tune the parameters of a quantum circuit [2, 3]. It is hypothesized that the hybrid training scheme along with the freedom of adjusting the parameters appropriately, makes these algorithms inherently robust to quantum hardware noise to some extent [2, 4]. This hypothesis is also inspired by classical neural networks, which are robust under certain types of noise. In the classical setting, one can broadly distinguish between two types of noise: benign noise that does not severely impact the training procedure or can even improve generalization [5–8], and adversarial noise which is deliberately constructed to study where neural networks fail [9–12]. Furthermore, we can distinguish between noise that is present during training, and noise that is present when using the trained model. Adversarial noise is usually of the latter case, where a trained neural network can produce completely wrong outputs due to small perturbations of the input data [13]. The benign type of noise mentioned above, on the other hand, is usually present at training time in form of perturbations of the input data, activation functions, weights or structure of the neural network, and has even been established as a method to combat overfitting in the classical literature [5–8, 14]. These results inspired the hypothesis that variational quantum algorithms possess a similar robustness to certain types of noise and may even benefit from its presence when trained on a quantum device. However, thorough investigations that confirm such robustness of VQAs against hardwarerelated noise, or even a beneficial effect from it, are still lacking. In terms of negative results for the trainability of VQAs under noise, it has been shown that optimization landscapes of noisy quantum circuits become increasingly flat at a rate that scales exponentially with the number of qubits under local Pauli noise when the circuit depth grows linearly with the number of qubits [15]. In the case of the variational quantum eigensolver, where the goal is to find the ground state of a given Hamiltonian, the presence of noise has been shown to lead to increasing deviation from the ideal energy [16]. Similar effects have been studied in the context of the quantum approximate optimization algorithm (QAOA) [17], where the goal is to find the ground state of a Hamiltonian that represents the solution to a combinatorial optimization problem [18, 19].
When it comes to QML, indepth studies on the effect of noise on the trainability and performance of VQAs are scarce. Apart from the work mentioned above on noiseinduced barren plateaus [15], the authors of [20] provided first insights into how the data encoding method used in a quantum classifier influences its resilience to varying types of noise. As for the potential benefit of noise, the authors of [21] show that the stochasticity induced by measurements in a QML model can help the optimizer to escape saddle points. The above results show that, on the one hand, too much noise will make the model untrainable, while on the other hand, modest amounts of noise can even improve trainability [21]. However, it remains unclear how large the gap is between tolerable and harmful amounts of noise [4], and it is not expected that this can be answered in a general way for all different types of learning algorithms and noise sources.
In this work, we shed light on this question from the angle of variational quantum reinforcement learning (QRL). Classical reinforcement learning (RL) models have been shown to be sensitive to noise, either during training [22] or in the form of adversarial samples [23, 24]. Additionally, it is known that a bottleneck of RL algorithms is their sample inefficiency, i.e., many interactions with an environment are needed for training [25]. Still, RL resembles humantype learning most closely among the main branches of modern ML, and therefore motivates further studies in this area. Among these studies, RL with VQAs has been proposed and extensively investigated in the noisefree setting over the past few years [26–34]. These results provide promising perspectives, as quantum models have empirically been shown to perform similarly to neural networks on small classical benchmark tasks [29, 32], while at the same time an exponential separation between classical and quantum learners can be proven for specific contrived environments based on classically hard tasks [28, 29]. These results motivate further studies on how large the abovementioned gap between tolerable and too much noise is in the case of variational RL algorithms, and how close the algorithm performance can get to the noisefree setting for various types of noise that can be present on nearterm devices.
We investigate this for two types of variational RL algorithms, Qlearning and the policy gradient method, by performing extensive numerical experiments for both types of algorithms with two different environments, CartPole and the Travelling Salesperson Problem, and under the effect of a wide class of noise sources, namely shot noise, coherent and incoherent errors. In Fig. 1 we summarise the approach of the present work showing the QRL models, environments and noise sources considered in the analysis. We start by considering the tradeoff between the number of measurement shots taken for each circuit evaluation and the performance of variational agents. As the number of shots required by a QML algorithm can be a bottleneck on nearterm devices and RL is known to require many interactions with the environment to learn, we propose a method for Qlearning to reduce the number of overall measurements by taking advantage of the structure of the underlying RL algorithm. Second, we model coherent errors with a random Gaussian perturbation of the variational parameters, and analytically study the effect of these perturbations on the output of parameterised quantum circuits, similarly to [35]. We provide an upper bound on the perturbation induced by such Gaussian coherent noise based on the Hessian matrix of the circuit, and theoretically and numerically show that hardwareefficient ansätze may be particularly resilient against this type of error due to small second derivatives [36]. Finally, we analyse the performance of the above algorithms under the action of incoherent errors coming from the unavoidable interaction of the qubits with the environment which we have no control over. To study this type of noise, we start by investigating the effect of singlequbit depolarization channels. In addition, we consider a custom noise model that combines various types of errors present on hardware, and study the effect of this noise model with error probabilities that are present in currently available superconducting quantum hardware. Our results show that both policy gradient methods and Qlearning exhibit a robustness to noise that may enable successfully running them on nearterm devices. This motivates further study in the quest to find a realworld problem of interest where a quantum advantage for variational RL could be possible.
2 Reinforcement learning
In this section, we will provide a brief introduction to RL that contains the basics necessary to understand this work. For a more indepth introduction to the topic we refer the reader to [37].
In RL, an agent learns to perform a specific task by trial and error through interacting with an environment. In contrast to supervised learning, this means that there is no necessity for a preexisting training dataset made of pairs of inputs and corresponding correct labels. Instead, the learning task is specified in terms of an environment and a reward function. The environment is defined in terms of its state space \(\mathcal{S}\) and its action space \(\mathcal{A}\), as well as a transition function \(P^{a}_{ss'} = P(s's, a)\) that specifies the probability of transitioning to state \(s'\), given that the previous state is s and action a is taken. The agent can use the actions \(a \in \mathcal{A}\) to move across states \(s \in \mathcal{S}\) of the environment, and receives a reward r that informs it about the quality of the chosen action. The agent chooses its actions based on a policy \(\pi (as)\) which specifies the probability of taking actions given states, and its goal is to maximize the rewards. This is formally defined as a quantity called the expected return, which is the random variable \(G_{t}\),
where \(\gamma \in [0, 1)\) is a discount factor that controls the significance of delayed rewards, t is the current time step and \(r_{t}\) represents the reward at the given time step. Typically we work in episodic environments with a fixed time horizon H, so that the sum in Equation (1) runs until H instead of infinity. We can then quantify the agent’s performance in terms of a value function,
which is the expected return when following a given policy π from an initial state s. There are many different approaches to maximize the expected return, and we focus on the two main paradigms used in stateoftheart RL: valuebased and policy gradient methods. We will now introduce both of these in more detail.
2.1 Valuebased methods
One approach to maximizing the expected return is to parameterize and train the value function in Equation (2) directly with a function approximator. This function approximator can be implemented for example as a neural network (NN) [38] or a parameterised quantum circuit (PQC) [26, 27, 29]. The valuebased method that we focus on in this work is called Qlearning. While the value function in Equation (2) is called the statevalue function as it only depends on the state, in Qlearning we try to approximate the actionvalue function that additionally depends on the action,
For a parametrized Qfunction \(Q_{\pi}(s, a; \boldsymbol {\theta})\) the goal is then to approximate the optimal Qfunction \(Q^{*}\) as closely as possible, where the optimal Qfunction is the one that leads to the optimal policy. The actions are chosen such that in each time step, the agent prefers to take the action that has the highest expected return, i.e.,
Due to this choice being deterministic, a Qlearning agent may never visit certain states of the environment and therefore not explore the state space sufficiently to find a good policy. In order to facilitate exploration, in practice a socalled ϵgreedy policy is used, where the agent selects a random action instead of that corresponding to the largest Qvalue with probability ϵ. Typically, ϵ is chosen large at the beginning and decreased over the course of training. In each training step, the Qvalues are updated as follows,
In order to train a function approximator like a NN or a PQC, the righthand side of Equation (5) is used as a label in a supervised learning setting. This means that the function approximator is updated based on its own predictions about the expected return under the current parametrization, in addition to the reward given by the environment. Consequently, the agent needs to learn a moving target, which can lead to instability of training and delayed convergence. Additionally, updates are always based on the latest observed rewards, so the agent can “forget” previously learned behaviour even when it was beneficial.
To stabilize training, two components have been added to the algorithm: a second model to compute the Qvalues on the righthand side of Equation (5), called the target model, which is identical to the Qfunction approximator but with parameters that are updated with a copy of the Qfunction approximator’s parameters only at fixed intervals. This decreases the rate of change in the prediction of the expected return used for parameter updates, and can therefore make learning more stable. Additionally, past interactions with the environment are stored in a memory and then randomly sampled to perform parameter updates to remove temporal correlations between transitions. For more detail on Qlearning with function approximators, also referred to as deep Qlearning in classical literature, we refer the reader to the seminal work [38].
2.2 Policy gradient method
As described above, a RL agent chooses its actions based on a policy \(\pi (as)\), which is the conditional probability distribution of actions given states. To maximize the expected return, the agent needs to find the optimal policy \(\pi ^{*}\). In policy gradient training, the agent is implemented in form of a parametrized policy \(\pi _{\boldsymbol {\theta}}\), and the goal of the algorithm is to find the parameters \(\boldsymbol {\theta}^{*}\) that produce the optimal policy. The quality of the policy is measured by a quantity \(J(\boldsymbol {\theta})\), that in the fixedhorizon setting is equal to the value function (2),
In a gradientbased optimization procedure the parameters are updated according to
with a learning rate α, i.e., we perform gradient ascent on the parameters to maximize the expected return. The policy gradient theorem [37] then states that the gradient of our performance measure can be written as
where \(\mu (s)\) is the onpolicy distribution under the current policy, which depends on the time spent in each state, and \(S_{t}\) in the third line of Equation (8) are states sampled under the policy π. Using this, we can now derive the REINFORCE algorithm, that is the basis of policy gradient based training.
Our goal is to perform gradient ascent on the parametrized policy purely from samples generated from said policy through interactions with the environment. The last line of Equation (8) still contains a sum over all actions a, which we can replace by the sample \(A_{t} \sim \pi \) after multiplying and dividing the terms in the sum by \(\pi _{\boldsymbol {\theta}}(aS_{t})\),
where \(G_{t}\) is the expected return from Equation (1). Now, by using the fact that \(\nabla \log x = \frac{\nabla x}{x}\), we can write
This equation allows us to estimate the gradient of \(J(\boldsymbol {\theta})\) by samples from the current policy \(\pi _{\boldsymbol {\theta}}\), and leads us to the following parameter update in each iteration of the algorithm,
where α is again the learning rate, \(R_{k}\) is the reward, and T is the length of the episode. Quantum versions of policy gradient based learning have been introduced in [28, 32], where the policy is implemented in form of a PQC.
3 Environments and implementation
Our goal is to get insight into the effect of noisy training on quantum RL algorithms. For this, we consider quantum versions of the two main paradigms in RL that have been introduced in previous sections: valuebased methods (see Sect. 2.1) and policy gradient methods (see Sect. 2.2). As we are interested in the effect of noisy training on models that have otherwise been proven to work well in the noisefree setting, we study models and environments that have been already investigated in this setting before [28, 29, 39]. In this way, we have evidence that the models and hyperparameters that we choose are suitable for the studied environments, and can focus our efforts on understanding the effect that noise has on the training and performance of these agents. The code that was used to generate the numerical results in this work can be found on Github [40].
3.1 CartPole
The first environment that we study is a benchmark task from the classical literature and implemented in the OpenAI Gym [41]: the CartPole environment. It has been previously studied in classical and quantum RL literature [27–29, 42]. In this environment, the goal is to learn to balance a pole that is attached to a cart that can move left and right on a frictionless track. The state s of the environment is represented by a four dimensional input vector \(s\rightarrow \boldsymbol{x}=(x_{1}, x_{2}, x_{3}, x_{4}) \in \mathbb{R}^{4}\) encoding the position and velocity of the cart, and the velocity and angle of the pole. There are two actions that the agent can perform: moving the cart left or right. The environment is considered as solved when the agent manages to balance the pole for an average of at least 195 time steps for 100 consecutive episodes. We implement noisy training for the CartPole environment using the policy gradient approach introduced in [28] and the Qlearning approach introduced in [29].
The circuit used for Qlearning in [29] consists of five layers of a hardwareefficient ansatz [43], where each circuit layer consists of one parametrized rotation around the xaxis per qubit that is used to encode the input states x, and additional parametrized y and zrotations on each qubit that contain the free parameters to be trained (see Fig. 2(a)). Furthermore, additional trainable parameters multiplying each input feature are used to increase the expressivity of the reuploading quantum circuit [44, 45]. Each layer also has a final layer of CZgates arranged in a circular topology. The observable for taking the action “left” is \(O_{L} = Z_{1} Z_{2}\), where \(Z_{1}\) and \(Z_{2}\) are PauliZ operators acting on the first and second qubit, respectively. Similarly, action “right” is associated to the observable \(O_{R} = Z_{3} Z_{4}\), defined on the third and fourth qubit. In order to facilitate the function approximation of the optimal Qfunction, which has a range of output values beyond that of \(Z_{i} Z_{j}\) operators, each expectation value is further multiplied with an additional trainable weight, such that the final Qvalue for action “left” is
where \(U_{\boldsymbol {\theta}}(s)\) represents the unitary of the parameterised circuit depending on the trainable parameters θ and the input state s, and \(w_{L}\) is the trainable weight corresponding to observable \(O_{L}\). The Qvalue for the action “right” is defined in a similar manner.
For the policy gradient method, we follow the implementation used in [28] and made available at [46], which uses five layers of the same hardwareefficient ansatz as described for Qlearning above, except that each layer has an additional trainable rotation around the xaxis on each qubit (see Fig. 2(b)), and the actions observables are defined as \(O_{L} = Z_{1} Z_{2} Z_{3} Z_{4}\) and \(O_{R} = \mathbb{I}  O_{L}\). As before, input features are multiplied with an additional trainable parameter each. Since the policy is a probability distribution, a final SoftMax layer is used to map the expectation values \(\langle O_{a}\rangle_{s,\boldsymbol{\theta}} \in [1,1]\) to the appropriate range \([0,1]\), and so probabilities for each action eventually become
where \(\beta \in \mathbb{R}\) is a also a trainable parameter.
3.2 Traveling salesperson problem
The second environment that we study is more complex and requires introducing the field of neural combinatorial optimization (NCO). NCO is an alternative to the handcrafted heuristics used in combinatorial optimization, where instead a machine learning model is trained to solve instances of a given combinatorial optimization problem [47]. In the case of RLbased NCO, the optimization problem is defined in terms of an environment and the quality of the solution is measured by the reward function. In this work, we study a quantum NCO approach that learns to solve instances of the Traveling Salesperson Problem (TSP), as introduced in [39]. In TSP one is presented with a list of cities in form of a weighted graph, and the goal is to find the tour of minimal length that visits each city in this list exactly once.
In this environment one episode consists in solving one instance, where the agent selects the cities in the tour in a stepwise fashion. States in this environment are instances of the TSP, in addition to the partial tour at the current time step. The actions are defined in terms of the cities, where in each time step the agent can select one of the cities that is not yet in the tour. The reward is the negative difference in length between the tour at the previous time step and the tour after adding the latest city, as we want to minimize the length of the tour while RL agents try to maximize the expected reward. We evaluate the quality of the tours proposed by the agents in terms of the approximation ratio
where \(c(T)\) is the length of the tour T proposed by the agent, and \(c(T^{*})\) is the length of the optimal tour \(T^{*}\). The stopping criterion for this environment is an average approximation ratio of at least 1.05 over the past 100 episodes.
To implement a quantum agent for this environment, we follow [39], where the information of the TSP graph instance is directly encoded into a PQC and each graph node corresponds to one qubit. Each layer in this ansatz consists of one rotation around the xaxis parametrized by \(\alpha _{i} \beta _{l}\), where \(\alpha _{i} \in \{0, \pi \}\) represents whether city i is already in the current tour (\(\alpha _{i} = 0\)), or still available for selection (\(\alpha _{i} = \pi \)), and \(\beta _{l}\) is a trainable parameter that is shared across all singlequbit gates in layer l. The graph’s edges in each layer are represented by a ZZgate parametrized by \(\varepsilon _{ij} \gamma _{l}\), where \(\varepsilon _{ij}\) is the weight of edge connecting nodes i and j, and \(\gamma _{l}\) is a trainable parameter that is shared across all twoqubit gates in layer l. Such ansatz is shown in Fig. 2(c).
In the case of Qlearning, the observables are ZZoperators that correspond to the edges in the graph, i.e., \(Z_{i} Z_{j}\) is measured for edge ij. For policy gradient agents the observables are the same, but as the policy has to be a probability distribution we again use a final SoftMax layer with a trainable inverse temperature β, as in Equation (14). The authors of [28] have shown that using this type of final layer can be highly beneficial for policy gradient training, compared to only using the probability distribution resulting from the quantum state directly. This is due to the fact that the trainable inverse temperature enables the agent to tune its level of exploration of the state space. As the optimal solutions to TSP instances are deterministic, it is favourable in this environment to have a tunable inverse temperature that allows exploration of the large state space early in training, as well as closetodeterministic decisions towards the end.
4 Shot noise
We start our studies with the type of noise that is arguably the simplest to characterize: noise induced by statistical errors that result from the probabilistic nature of quantum measurements. For each circuit evaluation, be it for action selection of the RL agent or for computing parameter updates via the parameter shift rule, we take a fixed number of measurements M and compute the resulting expectation value. The precision of this expectation value depends on M and scales like \(\epsilon \sim 1/\sqrt{M}\).
Variational algorithms often require a very large number of measurements to be executed, and this problem is exacerbated in QML tasks that typically involve separate circuit evaluations for all training data points. For this reason, it is not only important to understand the effect of shot noise on the trainability and performance of QML models, but it is also desirable to develop methods that lead to a smaller shot footprint than simply assigning a fixed number of shots to each circuit evaluation. Depending on knowledge of the algorithm itself, it can be possible to make an informed decision on the number of shots that suffice in each step. In this section, we develop such a method specifically for Qlearning that is a natural extension to the original algorithm.
4.1 Reducing the number of shots in a Qlearning algorithm
As described in Sect. 2.1, a Qlearning agent selects actions based on the following rule (see Equation (4))
that is, it chooses actions according to the largest Qvalue.^{Footnote 1} Now, consider a quantum agent that only has access to noisy estimates of the Qvalues \(\tilde{Q}(s_{t}, a_{t}; \boldsymbol {\theta})\) resulting from the statistical uncertainty of a measurement process involving a finite number of shots M. If the sample size is large enough \(M\gg 1\), then by the central limit theorem each noisy Qvalue can be described as a random variable
where \(Q(s_{t}, a_{t}; \boldsymbol {\theta})\) is the true noisefree value, and ϵ is a random variable sampled from a Gaussian distribution centered in zero \(\mu _{\epsilon} = \mathbb{E}[\epsilon ]=0\), and with standard deviation inversely proportional to the square root of the number of measurement shots \(\sigma _{\epsilon }= \operatorname{Std}[\epsilon ] \sim 1/\sqrt{M}\). Since actions are selected through an argmax function, the perturbation ϵ will not affect the action selection process as long as the order between the largest and the remaining Qvalues remains unchanged. Then, one may ask: is there a minimal number of shots that suffice to reliably distinguish the largest Qvalue \(Q_{max}\) and the secondlargest Qvalue \(Q_{2}\)?
When the observables associated to the actions are noncommuting, they have to be estimated independently from each other, and one has the freedom of choosing how to allocate the measurement shots among the observables of interest, possibly in a clever way. In our case, the goal is to estimate which of the observables has the highest Qvalue while trying to be shotfrugal, and this task can be related to the theory of multiarmed bandits [48]. The multiarmed bandit is a RL problem in which an agent can allocate only a limited amount of resources between a number of choices, e.g., a number of arms on a bandit machine, and is asked to determine which of these choices leads to the highest expected reward. There exists a tradeoff between exploration (i.e., trying the different arms) and exploitation (always choosing the arm that appears best according to the current knowledge), and the upper confidence bound (UCB) [49, 50] algorithm shows how to use statistical confidence bounds to allocate exploratory resources. The UCB algorithm could be used in the scenario described above where a number of noncommuting observables have to be estimated, and we want to find the optimal strategy to allocate a fixed budget of measurement shots to the task of identifying the largest Qvalue.
However, in the specific implementations of QRL agents based on recent literature that we study in this work [28, 29, 39], only commuting observables are used, hence it is not necessary to apply the UCB procedure to determine which one should be measured more often. Nonetheless, inspired by the UCB algorithm, we can still define a rather general simple heuristic that can be used to reduce the overall number of shots required to train the Qlearning models as those studied in this work. The idea is to use the knowledge about the scaling of the estimation error with respect to the number of measurements (see Equation (15)), to determine with confidence whether we have taken enough shots to determine the maximum Qvalue.
The procedure goes as follows. First, we take a small number of initial measurements \(m_{\mathrm{init}}\), for example \(m_{\mathrm{init}} = 100\), of all observables to compute the estimates \(\tilde{Q}_{m_{\mathrm{init}}}(s_{t}, a)\), \(\forall a \in \mathcal{A}\). Based on these values, we compute the absolute difference between the largest and the second largest Qvalues. If this difference is larger than twice the estimation error \(\epsilon = 2/\sqrt{m_{\mathrm{init}}}\) (as both of the Qvalues are noisy), we have found the largest Qvalue with high confidence and we stop here. On the other hand, if the difference is smaller, we increment the sample size with additional \(m_{\mathrm{inc}}\) measurements each, and recompute the estimated Qvalues with the \(m_{\mathrm{inc}} + m_{\mathrm{init}}\) shots. We again compute the absolute difference of the two largest Qvalues and determine whether the number of measurements suffices based on the error \(\epsilon = 2/\sqrt{m_{\mathrm{init}} + m_{\mathrm{inc}}}\). This measureandcompare scheme is performed until either the two largest Qvalues can be distinguished with high confidence, or a fixed shot budget \(m_{\mathrm{max}}\) is reached.
In Algorithm 1 we provide a description of this procedure, where for the sake of simplicity we describe the case where there are only two possible actions, and we therefore only have to find the larger of two Qvalues. However, the scheme can be used for an arbitrary number of Qvalues, as it is only important to distinguish between the highest and the secondhighest Qvalue with high confidence. The algorithm takes as input the number of initial measurements \(m_{\mathrm{init}}\), the number of additional measurements in every step \(m_{\mathrm{inc}}\), and the maximum number of measurements that are allowed in one run of the shotallocation algorithm (i.e., finding the largest Qvalue) \(m_{\mathrm{max}}\). The output is the number of measurements \(m_{\mathrm{est}}\) that are sufficient to find the argmax Qvalue with high confidence based on the rules above. The values \(\langle O_{a_{i}} \rangle _{m_{\mathrm{est}}}\) are the expectation values of observables \(O_{a_{i}}\) corresponding to action \(a_{i}\), estimated with \(m_{\mathrm{est}}\) shots. Note that the proposed scheme works both for commuting or noncommuting observables, where in the former case one can spare shots by computing the observables from the same set of measurement outcomes. Moreover, note that we ignore the coefficients in the statistics of the Qvalues coming from Equation (12), when considering the measurement stopping criterion. This choice has no impact on the effectiveness of the proposed method, as it is always found to be very well performing in the presented form.
While this algorithm can clearly determine the optimal number of shots in the action selection process in a methodical manner, one should check that this will not introduce errors in the remaining parts of the variational Qlearning model, i.e., during the parameter update step. Recall that each parameter update of the model is computed based on the output of the model itself (see Equation (5))
which means that in the parameter update step we do not need to perform action selection, but instead care about the actual Qvalues in order to compute the loss function. The question is now to what precision we need to approximate the Qvalues in order to learn a good Qfunction. Technically, even the noisefree Qfunction is only an approximation of the true Qfunction, which is the whole point of doing Qlearning with function approximators. This suggests that there is some leeway to make even the approximate function itself an approximation by taking only as many measurements as are necessary to find the argmax Qvalue with high confidence. Indeed, it has been shown in [29] that even the Qfunctions of agents that successfully solve an environment can produce Qvalues that are far from the optimal Qvalues, and that learning the correct order of Qvalues is more important in this setting than approximating the optimal Qvalue as precisely as possible. Consequently, when we compute the Qvalues that are used to perform parameter updates, we use the same algorithm as that in Algorithm 1 to determine the number of measurements to take.
4.2 Numerical results
We now numerically compare the performance of agents in the CartPole and TSP environments in settings where a fixed number of shots is used in each circuit evaluation, and where the number of shots in each step is determined by the algorithm we introduced in Sect. 4.1. To give an overview of the number of shots used in one training run under varying hyperparameter settings, we show the average cumulative number of shots for different settings in Fig. 3. For the CartPole environment (triangles), the number of cumulative shots grows quickly with the number of shots in each step in the fixed setting (orange). This is not true for the flexible shot allocation technique (blue), where for values of \(m_{\mathrm{max}} \in \{100, 1000, 10{,}000\}\) the cumulative number of shots is relatively similar. As we see in Fig. 4 a), a low number of shots such as 1000 is already sufficient to achieve close to optimal performance in the CartPole environment. Therefore, we focus on comparing settings with 100 and 1000 (maximum) shots per circuit evaluation in that figure. Comparing the cumulative number of shots for \(m_{\mathrm{fixed}}=100\) and \(m_{\mathrm{max}}=1000\) in Fig. 3, we see that these two configurations use almost the same number of measurements overall. Still, the final performance of the agents trained with the flexible shot allocation technique is almost optimal, while those trained with a fixed number of shots in each circuit evaluation are below a final score of 175 on average. However, as we allow agents to use even less than 100 shots per evaluation with the flexible allocation method of Algorithm 1, performance starts to degrade, so at least 100 shots are required in this setting. To not clutter the figure we show the results for agents that use fewer than 100 shots per circuit evaluation in Fig. 15 in the Appendix.
In the TSP environment, each step in an episode constitutes of a constant and (compared to CartPole) relatively low number of circuit evaluations. We still see that the higher the setting for the (maximum) number of shots is, the bigger the gap in average cumulative number of shots becomes. For agents trained in the TSP environment, shown in Fig. 4 b), the final performance remains unchanged by the additional noise introduced by the flexible shot allocation technique, and agents reach the same accuracy of those trained with a corresponding but fixed number of shots per circuit evaluation. The only difference between the two approaches is that the agents using the flexible shot allocation method take slightly longer to converge in some cases. Independently from the estimation method used (flexible or fixed), it is clear from Fig. 4 that it is the number of shots available that plays the major role in determining the performance of the noisy agents, as measured by the proximity to the average approximation ratios reached in the noisefree scenario, namely when agents have access to exact the expectation values (\(M \rightarrow \infty \)). In this environment, there is a tradeoff between delayed convergence due to less precision in the approximation of the Qfunction, and using a higher number of shots to arrive at the same final performance.
To summarize, we have seen that Qlearning models can be successfully trained even in the presence of statistical noise introduced by a measurement processes carried out with a limited number of shots. In addition, by leveraging the specifics of the Qlearning algorithm, we introduced an easytoimplement and effective method that can be used to reduce the number of shots needed to train variational Qlearning agents. How many shots one can save during training with this method depends on the agents’ resilience to shot noise, as well as the specific characteristics of the environment. In the CartPole environment, where one bad decision does not lead to immediate failure, the additional noise introduced by estimating expectation values with a low number of measurements and approximating an imprecise Qfunction does not affect performance severely. In the TSP environment on the other hand, where one bad choice of the next city in the tour can lead to a much longer path, we observe that the number of measurements has to be relatively high to get close to optimal performance. However, even in this setting we can achieve a reduction in the overall number of measurements by taking an informed approach at when to measure an observable more often.
5 Coherent noise
In this section, we turn our attention to coherent noise, that is, errors that preserve the unitary evolution of the quantum circuit but still change its output [51]. In our analysis, we model coherent noise as an over or underrotation of the parametrized gates, by adding a random Gaussian perturbation to the variational parameters in the considered circuits.
This type of error could occur in real quantum devices as a drift in the parameters for example due to an imperfect control of the system or a miscalibration of the hardware, and it is therefore an important component of the overall picture of an imperfect quantum device. Specifically, we assume that the perturbation remains unchanged during the estimation of a given observable, i.e. it does not change considerably between repeated measurements on the same experimental setup. However the perturbation amount changes whenever the experiment is changed, for example due to measuring a different observable, or using the circuit with a different set of parameters.
Gaussian coherent noise is also an interesting model because it lends itself very well to theoretical analysis, and one can estimate the effect of such an error on the output of a parameterised quantum circuit. In the following, we first proceed with an analytical treatment of the error introduced by Gaussian perturbations on variational circuits, and then proceed with the numerical results for the two environments considered in this work.
5.1 Effect of Gaussian coherent noise on circuit output
Consider a general parametrized quantum circuit acting on a system of n qubits, with unitary \(U(\boldsymbol{\theta}) \in \mathbb{C}^{2n} \times \mathbb{C}^{2n}\) and parameter vector \(\boldsymbol{\theta} = (\theta _{1}, \ldots , \theta _{M}) \in \mathbb{R}^{M}\). Let O be on observable and \(\rho = 0\rangle\langle0\) the initial state of the quantum system, the outcome of the variational circuit is the expectation value
Suppose that the parameters are affected by a noise process that adds a perturbation
where \(\delta \boldsymbol{\theta} = (\delta \theta _{1},\ldots , \delta \theta _{M}) \in \mathbb{R}^{M}\) are i.i.d. according to a Gaussian distribution \(\mathcal{N}(\mu , \sigma )\) with zero mean \(\mu =0\) and equal variance \(\sigma ^{2}\), namely
As discussed earlier, in our analysis in this section and in the numerical simulations in Sect. 5.3.1, we assume that the perturbed parameters remain the same during the evaluation of a single expectation value. In a real experiment on quantum hardware, this would mean that for all measurements used to estimate the expectation value, the perturbations stay at least approximately unchanged. Of course, without this assumption, the resulting noise model could not be considered unitary, and one may then resort to a noise channel formulation of Gaussian noise as proposed in [4, 35]. Hence, in the following we restrict our attention to the setting described above.
The effect of Gaussian noise on the circuit can be evaluated by Taylor expanding the circuit around the unperturbed parameters θ. For ease of explanation, we hereby report only the main ideas and results, and we refer to Appendix B for a complete and detailed derivation of all the calculations performed in this section.
Let \(f(\boldsymbol{\theta} + \delta \boldsymbol{\theta})\) be the function evaluated on the perturbed parameters, its Taylor expansion up to fourthorder reads
With this expression one can evaluate the expected value of the noisy function \(\mathbb{E}[f(\boldsymbol{\theta}+\delta \boldsymbol{\theta})]\) over the distribution of the Gaussian perturbations, \(\mathbb{E}(\cdot ) = \mathbb{E}_{\delta \theta _{i} \sim \mathcal{N}(0, \sigma ^{2})}(\cdot )\). Since every odd moment of a Gaussian distribution vanishes, using relations (18) in the expansion (19) one obtains
where \(\operatorname{Tr} [H(\boldsymbol{\theta})]\) denotes the trace of the Hessian matrix
Thus, the first nonvanishing correction term caused by the noise is proportional to the noise variance \(\sigma ^{2}\), and the Hessian of the parametrized quantum circuit, which conveys geometric information about the curvature of the function landscape around the unperturbed point θ.
Higherorder terms in the expansion can be evaluated in a similar way, specifically making use of socalled Wick’s relations for multivariate normal distributions as shown in Appendix B. If all the derivatives of the function \(f(\boldsymbol{\theta})\) are bounded, as it is the case for parametrized quantum circuits, then it is possible to derive an upper bound on the error induced by the perturbations which only depends on the noise strength \(\sigma ^{2}\) and the total number of parameters M, as we show in the following.
Using the parameter shift rule [52, 53], one can show that any derivative of a parametrized quantum circuit can be expressed as a linear combination of circuit outcomes evaluated at specific points in parameter space [35, 36]. Let \(\boldsymbol{\alpha} = (\alpha _{1}, \ldots , \alpha _{M}) \in \mathbb{N}^{M}\) be a multi index keeping track of the order of partial derivatives, define the derivative operator
where \(\boldsymbol{\alpha} := \sum_{i=1}^{M} \alpha _{i}\). By nested applications of the parameter shift rule, one can show that
where \(s_{m} \in \{\pm 1\}\) are signs, and \(\boldsymbol{\theta}_{m}\) are parameters obtained shifting the parameter vector θ along different directions. Now, since the measurement outcome of every circuit is bounded by the maximum absolute eigenvalue of the observable, i.e. \(f(\boldsymbol{\theta}) \leq \O\_{\infty}\), consequently it also holds that \(\partial ^{\boldsymbol{\alpha}} f(\boldsymbol{\theta}) \leq \O\_{\infty}\) (see Appendix B). Note that we only consider bounded observables here, like the Pauli operators commonly used in variational RL algorithms [26–29].
Since all the derivatives of the function are bounded, it is possible to bound every term in the Taylor series and then compute an upper bound to the error caused by the perturbation. In fact, defining the absolute (average) error caused by the noise as
one can prove that this is upper bounded by (see Appendix B)
Note that since \(\varepsilon _{\boldsymbol{\theta}}\leq 2\O\_{\infty}\) is always true, the bound is informative only as long as \(e^{\sigma ^{2} M /2 }  1<2\).
This expression only depends on the noise strength \(\sigma ^{2}\), the total number of noisy parameters M, and the operator norm of the observable \(\O\_{\infty}\), and it can be used to estimate a sufficient condition on the noise strength to guarantee a desired error threshold \(\varepsilon _{\boldsymbol{\theta}}\). Rearranging Equation (25), a sufficient condition to have error \(\varepsilon _{\boldsymbol{\theta}}\) not larger than ϵ, is to have Gaussian perturbations satisfying
As the allowable error is small \(\epsilon \ll 1\), by approximating the logarithm \(\log (1+x) \approx x\), one derives that the perturbations must follow the scaling
Note that a similar scaling law was recently derived also in [35], though via a slightly different method based on the moment generating function of the probability distribution characterising the perturbations.
To provide an example, assume one is willing to tolerate an error of \(\epsilon = 10\%\), that \(\O\_{\infty} = 1\) as for measuring a Pauli operator and that the PQC consists of \(M=100\) noisy parametrized gates, then one can be sure of such accuracy if \(\sigma \sim 0.1 / \sqrt{100} = 0.01\). However, we stress again that the scaling Equation (26) is only a sufficient but not necessary condition for achieving an error ϵ. In fact, apart from the requirement of bounded derivatives, Equation (26) is agnostic with respect to the specifics of the function, and such bound can be quite loose in real instances where a much larger noise level still causes a small error, as shown in Fig. 5.
In Fig. 5, we report simulation results obtained by simulating the parametrized ansatz depicted in Fig. 2(b) subject to Gaussian coherent noise of increasing strength. It is clear that the output of the circuit closely follows the approximation of Equation (20) given by the Hessian even at moderately large value of the noise \(\sigma \lessapprox 0.15\). When the noise is too strong (\(\sigma > 0.2\)), the circuit becomes essentially random, and the average expectation value when measuring a Pauli operator is zero. This is a consequence of PQCs often behaving like unitary designs upon random initialization of the parameters [54, 55], a fact which we discuss in detail in Sect. 5.2. At last, as discussed earlier, while the upper bound (25) holds, it is indeed very loose and only holds tightly at small \(\sigma \lessapprox 0.01\).
We now proceed discussing why hardwareefficient parametrized quantum circuits can be resilient to Gaussian coherent noise. Roughly, this is because such circuits are found to behave like random unitaries upon random assignment of the parameters, which implies that the derivatives of such circuits tend to vanish as the system size grows large [36].
5.2 Resilience of hardwareefficient ansatzes to Gaussian coherent noise
The previous analysis showed that Gaussian perturbations induce an error depending on the Hessian of the circuit (see Equation (20)), so that up to fourth order in the perturbation it holds that
This equation tells us that if the optimization landscape is flat or close to being flat, then the Hessian is small, and so the perturbation will have little effect on the output of the circuit. On the contrary, in the presence of a very curved landscape, noise will have a great impact and the output of the circuit may change sensibly. It is known that the curvature of the optimization landscape produced by a PQC is closely related to the barren plateau phenomenon [54–56], where the variance of the first and second derivative vanishes exponentially in the number of qubits and layers in a random circuit. Additionally, the hardwareefficient ansatz we use for some of the environments in this work is known to suffer from barren plateaus when the system size is large. As the curvature of the optimization landscape of these types of circuits is very flat, it can also be expected that the type of noise induced by the Gaussian perturbations on parameters that we study in this work should not affect circuits that generally produce small first and second order derivatives. While circuits that are in the barren plateau regime are obviously undesirable as they quickly become untrainable, one can consider circuits of the size such that the variance in gradients is relatively small, but the circuit has not yet converged to an approximate 2design, as shown in [54]. We make this statement more formal in the following.
We can use standard results on averages of unitary designs [57, 58] to characterize the Hessian of hardwareefficient circuits, and thus gain insight on their performance under Gaussian noise. We report the main results of our analysis here, full derivations can be found in Appendix B.2. In the following, we suppose that sampling a random value of the parameter vector θ in the parametrized circuit \(U(\boldsymbol{\theta})\), is equivalent to sampling a unitary from a unitary 2design, defined as a set of unitary matrices that match the Haar distribution up to the second moment. Also, we consider observables O being Pauli strings, so that \(\operatorname{Tr} [O] = 0\) and \(\operatorname{Tr} [O^{2}] = 2^{n}\). In order to distinguish from the previous notation where averages were computed over the Gaussian distribution of the perturbations, we use \(\mathbb{E}_{U}[\cdot ]\) and \(\operatorname{Var}_{U}[\cdot ]\) to denote average values and variances evaluated over the random unitaries.
Then, under reasonable and usual assumptions on parts of the parametrized quantum circuit being 2designs, it is possible to show that the diagonal elements of the Hessian \(H_{ii} = \partial ^{2} f(\boldsymbol{\theta})/\partial \theta _{i}^{2}\) satisfy [36] (see also Appendix B.2 for an explicit derivation)
That is, in addition to first order derivatives, also second order derivatives of random parameterized quantum circuits are found to be zero on average, and with a variance which is exponentially vanishing.
Starting from the results above, one can calculate the statistics of the trace of the Hessian, for which it holds
Furthermore, our numerical simulations suggest that the variance of the trace of the Hessian is actually smaller, and is well captured by the following expression
a fact which we justify and discuss in Appendix B.2.2.
In Fig. 6 we report simulation results of evaluating the trace of the Hessian matrix for the circuit shown in Fig. 2(b). The histogram represents the frequency of obtaining a given value of the trace of the Hessian \(\operatorname{Tr} [H(\boldsymbol{\theta})]\) upon random assignments of the parameters. Indeed, there is a very good agreement between the variance obtained via numerical simulations (black solid line), and the one calculated with the approximation (31) (dashed red line).
The circuit used has \(M=92\) parameters and \(n=4\) qubits, and plugging these values in Equation (31) yields a standard deviation \(\sigma _{U} = \operatorname{Std}_{U} [\operatorname{Tr} [H]] \approx 11\). Then, if the behaviour of the PQCs in practical scenarios is well described by its random parameter regime, one expects the trace of the Hessian to be on average zero and in general not much bigger (in absolute value) than \(\sigma _{U} \approx 11\). With this order of magnitude for the trace, the first order correction Equation (28) even with a Gaussian noise level of \(\sigma = 0.1\) is very small, as it amounts to
Summing up, for those PQCs whose cost landscape is close to being flat, then Gaussian perturbations on the variational parameters will have a limited impact on the output of the quantum circuit.
5.3 Numerical results
5.3.1 CartPole
First, we evaluate the performance of policy gradient and Qlearning algorithms when Gaussian perturbations are applied at each circuit evaluation during training. In Fig. 7 (a) and (b), we show the training and evaluation performance, respectively, of Qlearning agents in the CartPole environment with perturbations in the range \(\sigma \in \{0, 0.1, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2\}\). Only the agent trained with noise level \(\sigma =0.1\) learns the environment successfully and remains close to optimal performance. As suggested by our theoretical analysis in Sect. 5.1, performance starts to degrade as we consider higher perturbations of \(\sigma > 0.1\), and none of those agents manage to achieve a better performance than a score of 125 on average. In Fig. 7 (b) we evaluate the performance of trained agents when they act in an environment with different perturbation levels than those present when they were trained. Even agents that do not perform well during training achieve close to optimal performance when evaluated in the noisefree setting. This suggests that despite their bad training performance due to the added perturbations, these agents still learn a good Qfunction. Notably, the agents trained without noise perform worst when they are evaluated under various levels of perturbations.
Results for agents trained with the policy gradient method are shown in Fig. 8 (a). While again only the agents trained with a perturbation of \(\sigma =0.1\) perform well and even reach optimal performance, agents with higher perturbations also largely stay close to optimal performance with a final score of 125 on average. Even the agent trained with a relatively high \(\sigma =0.2\) is robust in this setting, even though it requires by far the most training episodes to get to a good score. This positive trend is also visible in Fig. 8(b), where we see that all agents achieve close to optimal performance when evaluated with perturbation levels \(\sigma \leq 0.1\), which is again in line we our theoretical analysis in Sect. 5.1. The difference between agents trained with Gaussian perturbations and those trained without is not as large as in the Qlearning setting, and at evaluation time both algorithms perform similarly. Another observation about the policy gradient agents is that those trained with \(\sigma =0.2\) achieve optimal or close to optimal performance in the environment under various perturbation levels at evaluation time, and are the most robust out of all agents trained in this setting. Overall, the policy gradient method shows a larger resilience to Gaussian noise in our experiments for the CartPole environment. It is an open question why this is the case, however, we did not observe better performance of the policy gradient algorithm under noise in general, as results in later sections will show.
In addition to studying the performance of Qlearning and policy gradient agents at training and evaluation time, we visualize the learned policies and Qfunctions of both in the noisy and noisefree setting in Fig. 9. As learned policies and Qfunctions can look different even when training the same agent twice, we show averages of the ten agents shown in Fig. 7 and Fig. 8 for both algorithms, and for perturbation levels of \(\sigma =0\) (blue) and \(\sigma =0.2\) (yellow), respectively. The CartPole environment has four inputs: cart position and velocity, and pole angle and velocity. To visualize the learned policies and Qfunctions, we show the probabilities and Qvalues for taking the action “right” as a function of pairs of state values. The state inputs that are not in the figure are set to zero, and for the sake of clarity we do not apply perturbations to the parameters when visualizing the policy. In Fig. 9 (a)(c), we see results for policy gradient agents. Overall, it can be seen that the agents trained without perturbations learn smoother policies, hence for most states there is a clear decision on which action to take. Training with perturbations makes the policies slightly more rippled, but they still mostly follow the contours of the policy learned under ideal conditions.
The approximated Qfunctions can be seen in Fig. 9(d)(f). One observation we make here is that the range that Qvalues take blows up considerably compared to the noisefree setting. This is due to the trainable output weights that the expectation values are multiplied with in the Qlearning setting (see Sect. 3) becoming considerably larger for agents trained in the noisy setting. However, as we can see in the Appendix in Fig. 17, the shapes of the learned Qfunctions of the noisefree and noisy agents are still very similar, which explains why even the agents trained with \(\sigma = 0.2\) perform almost optimally when evaluated without perturbations in Fig. 7 (b). We also note that the range of Qvalues of both the noisy and noisefree agents is much larger than the range of optimal Qvalues given in [29]. This can be understood as the agent consistently overestimating the expected return, a problem known to arise in classical Qlearning, and which is exacerbated by noise [59]. However, the authors of [29] also point out that in the function approximation setting, it is more important to learn the order of Qvalues for each state (i.e., preserving that the argmax Qvalue corresponds to the optimal action) than learning a close representation of the optimal Qvalues.
5.3.2 TSP
In this section, we study the performance of Qlearning and policy gradient algorithms with Gaussian coherent noise in the TSP environment. Panels (a) and (b) in Fig. 10 show the training and evaluation performance of Qlearning agents in this environment under perturbations in the range \(\sigma \in \{0, 0.1, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2\}\). We note that the Qlearning agents trained without noise already converge after 600 episodes on average, but to get an equal runtime in terms of episodes for all settings, we also let them run for 10,000 episodes. This unnecessarily long runtime causes the optimizer to leave the local minimum again, which we ignore as an artifact here and consider the lowest average approximation ratio for the comparison with the other models.
For the TSP environment, we observe that with increasing levels of Gaussian perturbations, convergence of agents is delayed and their final approximation ratio becomes worse compared to the noisefree agents’ performance. Still, all agents seem to learn very similar policies despite being trained with different settings of σ, as we can see by their almost identical performance at evaluation time shown in Fig. 10 (b). Despite a drop in performance during training, the final performance of the models on a test set of previously unseen TSP instances stays almost unaffected by the noise present during training. While we see that agents trained with more noise seem to learn more noiserobust policies as in the case of the CartPole environment, this effect is not as pronounced here. Additionally, we again see that performance of trained models in Fig. 10 (b) starts to drop at \(\sigma > 0.1\), as indicated by our theoretical analysis in Sect. 5.1. While the policy gradient method shows a certain robustness to noise during training in the CartPole environment, this is not the case for the TSP environment, as we show in Fig. 11 (a). The only agent that gets close in performance to the noise free agent is the one trained with \(\sigma =0.1\), while higher perturbations yield agents that are relatively bad with an approximation ratio between 1.4 and 1.6 on average. However, again, all agents seem to learn similar policies as indicated by their test performance in Fig. 11 (b). Similar to CartPole, the agents’ performance on the test set under varying perturbation levels closely matches that of the noisefree agents, and again we see a large drop in performance for perturbations that are higher than \(\sigma = 0.1\).
Overall, the Qlearning algorithm performs better in the TSP environment than the policy gradient method. The optimal tour for each TSP instance is deterministic, so using a stochastic policy as in the policy gradient approach introduces an additional source of error, as there is always a nonzero probability to chose a nonoptimal action. This leads to an increased susceptibility to the Gaussian perturbations present during the evaluation of the policy gradient algorithm. This is not the case for Qlearning, where choices are made based on the argmax Qvalue. Additionally, the ansatz that we use does not separate between data encoding and trainable parameters as described in Sect. 3. As the optimal tour of a TSP instance does not change upon small perturbations of the edge weights, this leads to a relative robustness of this ansatz used in conjunction with Qlearning to Gaussian coherent noise in this environment.
6 Incoherent noise
The Gaussian perturbation noise that we studied in Sect. 5 is wellsuited to model coherent errors due to imprecision in the control of the quantum device, but it does not reflect noise that results from undesired interactions of the quantum system with its environment. To study the effect of this type of incoherent noise we perform additional experiments in this section.
We simulate this type of noise with TensorFlow Quantum (TFQ) [60], where they are implemented through a MonteCarlo trajectory sampling method [61, 62] that approximates the effect of noise by averaging over state vectors generated from a probabilistic application of the noise channel. This method of simulating noise essentially trades off the overhead in memory needed to store the \(2^{n} \times 2^{n}\) sized density matrices necessary to simulate incoherent noise, with a runtime overhead. The precision of this approximation is determined by the number of repetitions, which specifies how many “noisy” state vectors are used. This adds a stochastic element to the simulation of the noise channels, and we get closer to simulating the exact noise model as the number of trajectories increases. Depending on the environments, we choose the number of trajectories so that it is possible to perform simulations in a reasonable time frame, and specify this number individually for each of the experiments below. We note that the runtime requirements for CartPole when simulating this type of noise are especially high, as the number of time steps in each episode, as well as the number of episodes itself depends strongly on the performance of the agent. In particular, agents that perform neither very well nor very badly, which are exactly the noise configurations we are interested in studying here, take especially long to simulate, as they do not converge early by solving the environment, but still take on the order of 100 time steps in each episode. Therefore we focus our attention mainly on the TSP environment in this section.
6.1 Depolarizing noise
Depolarization noise affects a quantum state by either replacing it with the completely mixed state with probability p, or leaving it untouched otherwise [63]. Let ρ be the density matrix of a qubit, then depolarizing noise is defined by the map
We model depolarization noise with Cirq [61] and TFQ by appending a layer of local depolarizing channels to every qubit after each time step of the computation, where a time step is defined as the largest set of gates that can be implemented simultaneously. This implementation takes into account the possibility of crosstalk between qubits [64]. Also, note that while the use of depolarizing channels alone may not be a good approximation of real single qubits errors, it may become a good effective description of the overall noise process for the case where many qubits and layers are used [65].
In our simulations, we assume that both single and twoqubits gates are noisy, and consist of a composition of the ideal gates followed by local depolarizing channels of equal probability p, acting independently on each qubit. In particular, the application of a depolarizing noise channel is implemented by performing one out of four actions at each circuit execution (trajectory): do nothing with probability \(1p\), or apply at random one of the three Pauli operators with probability p, and then average over the results. We remark that the average gate error of singlequbit gates in currently available superconducting quantum computing hardware is of the order of \(r\lessapprox 0.01\), with gate fidelities exceeding \(>99\%\). Finally, we note that one can relate the depolarisation strength p to the average gate error r over single qubit Cliffords, as measured by Randomized Benchmarking (RB) [64, 66, 67] and commonly reported for quantum devices [68, 69], via \(r = p/2\). However, our circuits do not only use Cliffords, and moreover the RB’s estimates for the gate error depend on the basis gates available on the device. Therefore, one should consider our simulations with depolarizing noise of strength p as a proxy for a quantum device whose average error rate r is of the same order of magnitude of p. While a singlequbit error noise model may not be accurate enough to closely mimic the behaviour of a real quantum device, it gives us the possibility to study the effect of singlequbit errors separately, before we go on to study a noise model that also includes twoqubit gate errors in Sect. 6.2.
As mentioned above, simulating incoherent noise has high runtime requirements, so in the following we limit our studies to: (i) Qlearning in the CartPole environment, and (ii) the policy gradient method in the TSP environment. We pick these settings as they were the ones that were more sensitive to Gaussian coherent noise in our studies in Sect. 5, and in that sense represent the worst case instances from the previous section. To simulate the noisy quantum circuits, we use the Monte Carlo sampling as described above, where the number of trajectories used depends on the environment. As the CartPole environment requires a very high number of environment interactions (the better the agent, the more circuit evaluations are required per episode), we use 100 trajectories in this setting. In the TSP environment, the number of steps in each episode is constant and therefore we can use a higher number of 1000 trajectories and still perform simulations in a timely manner.
Figure 12 shows results of Qlearning agents trained in the CartPole environment with various error probabilities p. Agents with a realistic error probability of up to \(p=0.01\) still solve the environment in less than 2000 episodes on average. Agents trained with error probability \(p=0.005\) reach higher scores almost as quickly as agents trained in the noisefree setting, but stay somewhat unstable until they solve the environment after 3500 episodes on average. When the noise probability is increased to \(p=0.1\), we see that agents fail to make any learning progress at all.
Figure 13 shows the performance of the policy gradient method under onequbit depolarization errors in the TSP environment. In this setting, agents trained with error probability \(p=0.01\), as is a realistic assumption on current devices, perform noticeably worse than agents in the noisefree setting with a drop in approximation ration of around 0.2 on average. Only when we consider an error probability of \(p=0.001\) do we get performance that is almost exactly the same as that in the noisefree case. Similar to the results of the Qlearning agent in the CartPole environment, agents trained with an error probability of \(p=0.1\) show no meaningful learning progress.
6.2 Noise model based on current hardware
After studying the effect of singlequbit depolarization errors in Sect. 6.1, we now study the performance of the Qlearning algorithm in the TSP environment in the presence of a more realistic noise model that captures the behaviour of a nearterm superconductive quantum device. The error sources we incorporate into this noise model are the following: singlequbit and twoqubit depolarization errors, single qubit amplitude damping error, and measurement noise. While hardware providers like IBM and Google offer the possibility of simulating noise models of specific devices, we do not want to take devicespecific factors like qubit topology and native gate sets into account in this work, as the performance in these settings also depends strongly on the quality of the circuit compiled to the native gate set and qubit connectivity [70]. Instead, we define a custom noise model based on gate fidelities published by hardware vendors, but do not take the above details into account. To determine realistic settings for the error probability of each noise source, we use calibration data published by IBM [71] at the time of writing. The noise model used in our simulation is specified as follows:

Depolarization error: Single qubit depolarization channels with \(p=0.001\) are applied after every single qubit gate. Twoqubit depolarization errors, defined by properly adjusting the definition in Equation (32), with \(p_{2}=0.01\) are applied after every twoqubit gate on the corresponding pair of qubits.

Amplitude damping error: Amplitude damping channels with decay parameter \(\gamma = 0.003\) are applied after each single and twoqubit gate on the corresponding qubits. Such a decay rate is valid for real devices having single qubit gate durations of \(t=35~\mathrm{ns}\), and average qubit decay times \(T_{1} \approx 100~\mu \mathrm{s}\), which correspond to a decay parameter of \(\gamma = 1  \exp (t/T1) \approx 0.0003\).

Measurement noise Measurement errors are modeled by appending a bitflip channel with probability \(p=0.01\) to every qubit right before the measurement process.
We recall that the circuit ansatz for the TSP environment is the one depicted in Fig. 2(c), where input information about the edge weights of the TSP instance is encoded by means of twoqubit gates. We therefore chose to study this ansatz in the context of a noise model that incorporates twoqubit errors, as we expect that these types of errors will affect performance of an ansatz that encodes crucial information in twoqubit gates more severely. Additionally, it is hard to perform simulations in this setting for the CartPole environment in a reasonable amount of time, as discussed above. For these reasons, we restrict our attention to the TSP environment in this section.
Figure 14 shows results averaged over five Qlearning agents in the TSP environment for each of the error probability configurations of the custom noise model described above. We show the specific error probabilities used for the simulations in Table 1. Configuration a) corresponds to error probabilities that are consistent with those present on current quantum hardware as described above. Based on this, we specify three other error probabilities b)  d) by increasing the error on varying error sources. We note that while the error probabilities themselves in configuration a) are consistent with those on current hardware, our simulation is only an approximation of this error due to the Monte Carlo trajectory sampling method described in Sect. 6. To perform simulations in a reasonable time frame, we use 1000 trajectories for each circuit evaluation. The circuit that we simulate has 145 gates (counting a ZZgate as two CNOTs and one Z gate), and for small error probabilities the chance of applying each of the noise channels is relatively small. This means that in each trajectory, a relatively small number of noise channels is applied. Hence we expect that the results in Fig. 14 are slightly better than what we would get if the exact noise model was simulated (i.e., in the limit of a large number of trajectories, or by considering the full density matrix).
Looking at the results in Fig. 14, we see that for configuration a) (blue), the performance of the agents matches those of the noisefree ones (dotted black) almost exactly, and the noise model based on realistic error strengths of current devices does not affect training. We see a slight drop in performance when we increase the error probability of the amplitude damping channels from 0.0003 to 0.03 (orange), as described in Table 1, column b). For configuration c), we also increase the other remaining error sources’ probabilities, which leads to a considerable drop in performance. In configuration d), we assume extremely high error probabilities for each of the noise channels, which leads to a complete failure of the agents to make any meaningful learning progress in this environment.
7 Conclusions
Our goal in this work was to evaluate the resilience of variational RL algorithms to various types of noise that are present on real quantum hardware. First, we investigated shot noise, which results from the probabilistic nature of quantum measurements. We introduced a method to reduce the number of shots to train a Qlearning agent, motivated by the specific structure of the underlying RL algorithm. Our shot allocation technique enables a more shotfrugal training of variational Qlearning models with little or no effect on the final performance of the agents.
After considering shot noise, we moved on to study the effect of Gaussian coherent errors that can arise on real hardware due to miscalibration of the device, or imprecise pulse sequences that implement the parameterised gates in the quantum circuit. We gave an analytic expression for how this type of noise affects the output of a quantum RL agent, and provided a bound on the standard deviation of the Gaussian error that elucidates the tolerable magnitude of the error on the output of a quantum model. We confirm this bound in our simulations, where we study the effect of various levels of Gaussian perturbations on the performance of training policy gradient and Qlearning agents in two different environments. For one of these environments, we find that agents trained with higher noise probabilities also learn more robust policies and Qfunctions, in the sense that under evaluation of different perturbation levels, these agents achieve optimal or close to optimal performance more often.
Finally, we studied incoherent noise that emerges in real hardware due to undesired interactions of the qubits with the surrounding environment, as the device is not completely shielded from external effects. To this end, we consider singlequbit depolarization errors, as well as a custom noise model that combines single and two qubit depolarization errors, amplitude damping errors, and bitflip (measurement) errors. For the latter, we perform simulations with realistic error probabilities for each of the noise channels, in line with data published for IBM devices at the time of writing.
Overall, we find that the effect of noise on training variational RL algorithms for Qlearning and the policy gradient method depends strongly on the strength of the noise, as well as the type of noise itself. For some cases, like decoherence errors with realistic error probabilities of current devices, the drop in performance is relatively small. On the other hand, we find that large Gaussian perturbations as well as errors induced by the probabilistic nature of quantum measurements can affect performance in highly detrimental ways. Additionally, we find that for Gaussian coherent noise agents that are trained with higher perturbations learn more noiserobust policies in some cases, similar to results in classical literature, where noise is used as a regularization technique.
While our results were performed in a regime that is still efficiently simulable on classical computers, it is an interesting question for future work to consider the implications of noiserobustness of largescale quantum models in light of recent results which show that in certain settings, the outputs of noisy quantum circuits can be efficiently approximated classically [72, 73]. This raises the question to what extent an inherent noiserobustness of hybrid variational quantum machine learning affects the possibility to achieve a quantum advantage with these types of models.
On the practical side, the optimization procedures that we used in this work were the same as those commonly used to train models in noisefree simulations and are not tailored to account for quantum hardware specific noise. This raises the question on how optimization methods that are tailored for the special characteristics of variational quantum models could further improve the performance of these types of models in a noisy setting. For the optimization of PQC parameters in the combinatorial optimization or quantum chemistry setting, it is known that some optimization methods, like simultaneous perturbation stochastic approximation (SPSA), actually become better with noise. It is an interesting area of future research to design quantumspecific optimization routines for machine learning that address or even combat specific types of noise, for example leveraging effective quantum error mitigation techniques [74–76]. Our work motivates the study of these types of optimization methods, as well as continued efforts to find learning tasks where variational RL algorithms can potentially provide an advantage.
Availability of data and materials
The code that was used to generate the numerical results in this work can be found on GitHub (https://github.com/askolik/noisy_qrl), along with the data set containing the TSP instances studied in this work and their optimal solutions.
Notes
In the ϵgreedy policy (see Sect. 2.1) we consider here, the agent picks either the action corresponding to the argmax Qvalue, or a random action. As no circuit evaluation is required to pick a random action, we only consider the steps with actual action selection by the agent in this section.
References
Bharti K, CerveraLierta A, Kyaw TH, Haug T, AlperinLea S, Anand A, Degroote M, Heimonen H, Kottmann JS, Menke T, Mok WK, Sim S, Kwek LC, AspuruGuzik A. Noisy intermediatescale quantum algorithms. Rev Mod Phys. 2022;94:015004.
Cerezo M, Arrasmith A, Babbush R, Benjamin SC, Endo S, Fujii K, McClean JR, Mitarai K, Yuan X, Cincio L et al.. Variational quantum algorithms. Nat Rev Phys. 2021;3(9):625–44.
Mangini S, Tacchino F, Gerace D, Bajoni D, Macchiavello C. Quantum computing models for artificial neural networks. Europhys Lett. 2021;134(1):10002.
Gentini L, Cuccoli A, Pirandola S, Verrucchi P, Banchi L. Noiseresilient variational hybrid quantumclassical optimization. Phys Rev A. 2020;102:052414.
Jim KC, Giles CL, Horne BG. An analysis of noise in recurrent neural networks: convergence and generalization. IEEE Trans Neural Netw. 1996;7(6):1424–38.
Noh H, You T, Mun J, Han B. Regularizing deep neural networks by noise: its interpretation and optimization. In: Advances in neural information processing systems. vol. 30. 2017.
Graves A. Practical variational inference for neural networks. In: Advances in neural information processing systems. vol. 24. 2011.
Graves A, Mohamed A, Hinton G. Speech recognition with deep recurrent neural networks. In: 2013 IEEE international conference on acoustics, speech and signal processing. Ieee; 2013. p. 6645–9.
Balda ER, Behboodi A, Mathar R. Adversarial examples in deep neural networks: an overview. In: Deep learning: algorithms and applications; 2020. p. 31–65.
Xie C, Wang J, Zhang Z, Ren Z, Yuille A. Mitigating adversarial effects through randomization. 2017. arXiv preprint. arXiv:1711.01991.
Gilmer J, Ford N, Carlini N, Cubuk E. Adversarial examples are a natural consequence of test error in noise. In: International conference on machine learning. PMLR; 2019. p. 2280–9.
Jaeckle F, Kumar MP. Generating adversarial examples with graph neural networks. In: Uncertainty in artificial intelligence. PMLR; 2021. p. 1556–64.
Goodfellow IJ, Shlens J, Szegedy C. Explaining and harnessing adversarial examples. 2014. arXiv preprint. arXiv:1412.6572.
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res. 2014;15(1):1929–58.
Wang S, Fontana E, Cerezo M, Sharma K, Sone A, Cincio L, Coles PJ. Noiseinduced barren plateaus in variational quantum algorithms. Nat Commun. 2021;12(1):6961.
Zeng J, Wu Z, Cao C, Zhang C, Hou SY, Xu P, Zeng B. Simulating noisy variational quantum eigensolver with local noise models. Quantum Eng. 2021;3(4):e77.
Farhi E, Goldstone J, Gutmann S. A quantum approximate optimization algorithm. 2014. arXiv preprint. arXiv:1411.4028.
Alam M, AshSaki A, Ghosh S. Analysis of quantum approximate optimization algorithm under realistic noise in superconducting qubits. 2019. arXiv preprint. arXiv:1907.09631.
Harrigan MP, Sung KJ, Neeley M, Satzinger KJ, Arute F, Arya K, Atalaya J, Bardin JC, Barends R, Boixo S et al.. Quantum approximate optimization of nonplanar graph problems on a planar superconducting processor. Nat Phys. 2021;17(3):332–6.
LaRose R, Coyle B. Robust data encodings for quantum classifiers. Phys Rev A. 2020;102:032420.
Liu J, Wilde F, Mele AA, Jiang L, Eisert J. Noise can be helpful for variational quantum algorithms. 2022. arXiv preprint. arXiv:2210.06723.
Wang J, Liu Y, Li B. Reinforcement learning with perturbed rewards. In: Proceedings of the AAAI conference on artificial intelligence. vol. 34. 2020. p. 6202–9.
Huang S, Papernot N, Goodfellow I, Duan Y, Abbeel P. Adversarial attacks on neural network policies. 2017. arXiv preprint. arXiv:1702.02284.
Kos J, Song D. Delving into adversarial attacks on deep policies. 2017. arXiv preprint. arXiv:1705.06452.
Yu Y. Towards sample efficient reinforcement learning. In: IJCAI. 2018. p. 5739–43.
Chen SYC, Yang CHH, Qi J, Chen PY, Ma X, Goan HS. Variational quantum circuits for deep reinforcement learning. IEEE Access. 2020;8:141007–24.
Lockwood O, Si M. Reinforcement learning with quantum variational circuit. In: Proceedings of the AAAI conference on artificial intelligence and interactive digital entertainment. vol. 16. 2020. p. 245–51.
Jerbi S, Gyurik C, Marshall S, Briegel H, Dunjko V. Parametrized quantum policies for reinforcement learning. Adv Neural Inf Process Syst. 2021;34:28362–75.
Skolik A, Jerbi S, Dunjko V. Quantum agents in the gym: a variational quantum algorithm for deep qlearning. Quantum. 2022;6:720.
Lan Q. Variational quantum soft actorcritic. 2021. arXiv preprint. arXiv:2112.11921.
Wu S, Jin S, Wen D, Wang X. Quantum reinforcement learning in continuous action space. 2020. arXiv preprint. arXiv:2012.10711.
Sequeira A, Santos LP, Barbosa LS. Variational quantum policy gradients with an application to quantum control. 2022. arXiv preprint. arXiv:2203.10591.
Lockwood O, Si M. Playing atari with hybrid quantumclassical reinforcement learning. In: NeurIPS 2020 workshop on preregistration in machine learning. PMLR; 2021. p. 285–301.
Franz M, Wolf L, Periyasamy M, Ufrecht C, Scherer DD, Plinge A, Mutschler C, Mauerer W. Uncovering instabilities in variationalquantum deep qnetworks. 2022. arXiv preprint. arXiv:2202.05195.
Ito K, Mizukami W, Fujii K. Universal noiseprecision relations in variational quantum algorithms. 2021. arXiv preprint. arXiv:2106.03390.
Cerezo M, Coles PJ. Higher order derivatives of quantum neural networks with barren plateaus. Quantum Sci Technol. 2021;6(3):035006.
Sutton RS, Barto AG. Reinforcement learning: an introduction. Cambridge: MIT Press; 2018.
Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G et al.. Humanlevel control through deep reinforcement learning. Nature. 2015;518(7540):529–33.
Skolik A, Cattelan M, Yarkoni S, Bäck T, Dunjko V. Equivariant quantum circuits for learning on weighted graphs. 2022. arXiv preprint. arXiv:2205.06109.
Skolik A, Mangini S. Code that was used for training of noisy quantum agents. 2022. https://github.com/askolik/noisy_qrl.
Openai gym. https://github.com/openai/gym/wiki. Accessed: 06092022.
Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D. Continuous control with deep reinforcement learning. 2015. arXiv preprint. arXiv:1509.02971.
Kandala A, Mezzacapo A, Temme K, Takita M, Brink M, Chow JM, Gambetta JM. Hardwareefficient variational quantum eigensolver for small molecules and quantum magnets. Nature. 2017;549(7671):242–6.
PérezSalinas A, CerveraLierta A, GilFuster E, Latorre JI. Data reuploading for a universal quantum classifier. Quantum. 2020;4:226.
Schuld M, Sweke R, Meyer JJ. Effect of data encoding on the expressive power of variational quantummachinelearning models. Phys Rev A. 2021;103:032430.
Tensorflow quantum rl tutorial. https://www.tensorflow.org/quantum/tutorials/quantum_reinforcement_learning. Accessed: 06092022.
Bello I, Pham H, Le QV, Norouzi M, Bengio S. Neural combinatorial optimization with reinforcement learning. 2016. arXiv preprint. arXiv:1611.09940.
Slivkins A et al.. Introduction to multiarmed bandits. Found Trends Mach Learn. 2019;12(1–2):1–286.
Lai TL, Robbins H et al.. Asymptotically efficient adaptive allocation rules. Adv Appl Math. 1985;6(1):4–22.
Auer P. Using confidence bounds for exploitationexploration tradeoffs. J Mach Learn Res. 2002;3(Nov):397–422.
Cai Z, Xu X, Benjamin SC. Mitigating coherent noise using pauli conjugation. npj Quantum Inf. 2020;6(1):1–9.
Schuld M, Bergholm V, Gogolin C, Izaac J, Killoran N. Evaluating analytic gradients on quantum hardware. Phys Rev A. 2019;99:032331.
Mitarai K, Negoro M, Kitagawa M, Fujii K. Quantum circuit learning. Phys Rev A. 2018;98:032309.
McClean JR, Boixo S, Smelyanskiy VN, Babbush R, Neven H. Barren plateaus in quantum neural network training landscapes. Nat Commun. 2018;9(1):4812.
Cerezo M, Sone A, Volkoff T, Cincio L, Coles PJ. Cost function dependent barren plateaus in shallow parametrized quantum circuits. Nat Commun. 2021;12(1):1791.
Holmes Z, Sharma K, Cerezo M, Coles PJ. Connecting ansatz expressibility to gradient magnitudes and barren plateaus. PRX Quantum. 2022;3:010313.
Huang HY, Kueng R, Preskill J. Predicting many properties of a quantum system from very few measurements. Nat Phys. 2020;16(10):1050–7.
Puchała Z, Miszczak JA. Symbolic integration with respect to the Haar measure on the unitary groups. Bull Pol Acad Sci, Tech Sci. 2017;65(1):21–7.
Van Hasselt H, Guez A, Silver D. Deep reinforcement learning with double qlearning. In: Proceedings of the AAAI conference on artificial intelligence. vol. 30. 2016.
Broughton M, Verdon G, McCourt T, Martinez AJ, Yoo JH, Isakov SV, Massey P, Halavati R, Niu MY, Zlokapa A, et al. Tensorflow quantum: a software framework for quantum machine learning. 2020. arXiv preprint. arXiv:2003.02989.
Google Inc. Documentation of depolarizing channel in cirq. 2022. https://quantumai.google/reference/python/cirq/depolarize.
Isakov SV, Kafri D, Martin O, Heidweiller CV, Mruczkiewicz W, Harrigan MP, Rubin NC, Thomson R, Broughton M, Kissell K, Peters E, Gustafson E, Li ACY, Lamm H, Perdue G, Ho AK, Strain D, Boixo S. Simulations of quantum circuits with approximate noise using qsim and cirq. 2021.
Nielsen MA, Chuang IL. Quantum computation and quantum information. Cambridge: Cambridge University Press; 2010.
Proctor T, Seritan S, Rudinger K, Nielsen E, BlumeKohout R, Young K. Scalable randomized benchmarking of quantum computers using mirror circuits. Phys Rev Lett. 2022;129:150502.
Vovrosh J, Khosla KE, Greenaway S, Self C, Kim MS, Knolle J. Simple mitigation of global depolarizing errors in quantum simulations. Phys Rev E. 2021;104:035309.
Magesan E, Gambetta JM, Emerson J. Characterizing quantum gates via randomized benchmarking. Phys Rev A. 2012;85:042311.
McKay DC, Sheldon S, Smolin JA, Chow JM, Gambetta JM. Threequbit randomized benchmarking. Phys Rev Lett. 2019;122:200502.
RyanAnderson C, Brown NC, Allman MS, Arkin B, AsaAttuah G, Baldwin C, Berg J, Bohnet JG, Braxton S, Burdick N, Campora JP, Chernoguzov A, Esposito J, Evans B, Francois D, Gaebler JP, Gatterman TM, Gerber J, Gilmore K, Gresh D, Hall A, Hankin A, Hostetter J, Lucchetti D, Mayer K, Myers J, Neyenhuis B, Santiago J, Sedlacek J, Skripka T, Slattery A, Stutz RP, Tait J, Tobey R, Vittorini G, Walker J, Hayes D. 2022.
Ibmquantum. 2022. https://quantumcomputing.ibm.com/.
Pelofske E, Bärtschi A, Eidenbenz S. Quantum volume in practice: what users can expect from nisq devices. 2022. arXiv preprint. arXiv:2203.03816.
IBM Quantum Experience. https://quantumcomputing.ibm.com/services/resources?tab=systems; 2022.
França DS, GarciaPatron R. Limitations of optimization algorithms on noisy quantum devices. Nat Phys. 2021;17(11):1221–7.
Gao X, Duan L. Efficient classical simulation of noisy quantum computation. 2018. arXiv preprint. arXiv:1810.03176.
LaRose R, Mari A, Kaiser S, Karalekas PJ, Alves AA, Czarnik P, El Mandouh M, Gordon MH, Hindy Y, Robertson A, Thakre P, Wahl M, Samuel D, Mistri R, Tremblay M, Gardner N, Stemen NT, Shammah N, Zeng WJ. Mitiq: a software package for error mitigation on noisy quantum computers. Quantum. 2022;6:774.
Russo V, Mari A, Shammah N, LaRose R, Zeng WJ. Testing platformindependent quantum error mitigation on noisy quantum computers. 2022.
Wang S, Czarnik P, Arrasmith A, Cerezo M, Cincio L, Coles PJ. Can error mitigation improve trainability of noisy variational quantum algorithms? 2021.
Huembeli P, Dauphin A. Characterizing the loss landscape of variational quantum circuits. Quantum Sci Technol. 2021;6(2):025011.
Fukuda M, König R, Nechita I. RTNI—a symbolic integrator for Haarrandom tensor networks. J Phys A, Math Theor. 2019;52(42):425303.
Keener RW. Theoretical statistics: topics for a core course. 1st ed. Springer texts in statistics. Berlin: Springer; 2010.
Bergholm V, Izaac J, Schuld M, Gogolin C, Ahmed S, Ajith V, Alam MS, AlonsoLinaje G, AkashNarayanan B, Asadi A, Arrazola JM, Azad U, Banning S, Blank C, Bromley TR, Cordier BA, Ceroni J, Delgado A, Di Matteo O, Dusko A, Garg T, Guala D, Hayes A, Hill R, Ijaz A, Isacsson T, Ittah D, Jahangiri S, Jain P, Jiang E, Khandelwal A, Kottmann K, Lang RA, Lee C, Loke T, Lowe A, McKiernan K, Meyer JJ, MontañezBarrera JA, Moyard R, Niu Z, O’Riordan LJ, Oud S, Panigrahi A, Park CY, Polatajko D, Quesada N, Roberts C, Sá N, Schoch I, Shi B, Shu S, Sim S, Singh A, Strandberg I, Soni J, Száva A, Thabet S, VargasHernández RA, Vincent T, Vitucci N, Weber M, Wierichs D, Wiersema R, Willmann M, Wong V, Zhang S, Killoran N. Pennylane: automatic differentiation of hybrid quantumclassical computations. 2018.
Acknowledgements
AS is funded by the German Ministry for Education and Research (BMB+F) in the project QAI2QKIS under grant 13N15587. This work was also supported by the Dutch Research Council (NWO/OCW), as part of the Quantum Software Consortium programme (project number 024.003.037). CM acknowledges support by the National Research Centre for HPC, Big Data and Quantum Computing (ICSC: MUR project CN00000013).
Author information
Authors and Affiliations
Contributions
AS conceived the idea for this work and conducted the numerical experiments. SM performed analytical study on the effect of Gaussian noise and provided decoherence noise model. AS and VD proposed shot allocation algorithm. AS and SM wrote the first version of the manuscript, all authors contributed to the final editing.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Appendices
Appendix A: Additional results for flexible vs. fixed number of shots in Qlearning
Appendix B: Gaussian noise analysis
In this Appendix we perform the noise analysis of a scalar function whose parameters are corrupted by independently distributed Gaussian perturbations. Let \(f: \mathbb{R}^{M} \rightarrow \mathbb{R}\) be the function under investigation, whose parameters \(\boldsymbol{\theta} = (\theta _{1}, \ldots , \theta _{M})\in \mathbb{R}^{M}\) are corrupted by a Gaussian noise \(\theta _{i} \rightarrow \theta _{i} + \delta \theta _{i}\) with zero mean and variance \(\sigma ^{2}\), i.e.
Since the perturbations are independently distributed and Gaussian, all higher order moments can be evaluated starting from two points correlators of the form \(\mathbb{E}[\delta \theta _{i}\delta \theta _{j}]\), as dictated by Wick’s formulas for multivariate normal distributions
where with \(\mathcal{{P}}\) we denote all the possible distinct \((2n1)!!\) pairings of the n variables, as these can be used to express all higher order even moments in terms of products of second moments. Note that all the terms involving an odd number of perturbations \(\delta \theta _{i}\) vanish, and only even moments of remain. For example, expression (B.2) for the fourthorder moment (\(n=4\)) amounts to
We now proceed considering the multi dimensional Taylor expansion of the function \(f(\boldsymbol{\theta}+\delta \boldsymbol{\theta})\) around the noisefree point. Up to arbitrary order, this reads
where we used the equal sign because we are considering the full Taylor series, and we assume that this converges to the true function (this statement can be made precise by showing that the reminder term of the expansion goes to zero as the order of expansion goes to infinity).
Before proceeding, we simplify the notation to make the calculation of the Taylor expansion easier to follow. First, we denote the partial derivatives with respect to parameter \(\theta _{i}\) as \(\partial _{i} := \partial /\partial \theta _{i}\), and similarly for higher order derivatives, for example \(\partial _{ij} = \partial ^{2} /\partial{\theta _{i}}\partial{ \theta _{j}}\). Also, we suppress the explicit dependence of the function on θ, using the shorthand f instead of \(f(\boldsymbol{\theta})\). At last, we make use of Einstein’ summation notation where repeated indexes imply summation.
With this setup, using Eqs. (B.1), (B.2) and (B.3) in (B.4), one can evaluate the expectation value of the function over the perturbations’ distributions as
where in the last line we simplified the fourth order term as
Since the expectation values involving an odd number of perturbations vanish, only the even order terms survive, and these can be expressed as
where the coefficient \((2n1)!!\) is the number of distinct pairings of 2n objects, which comes from Eq. (B.1).
Thus, the full Taylor series can be formally written as
where we introduced the Hessian matrix \(H(\boldsymbol{\theta})\), whose elements are given by \([H(\boldsymbol{\theta})]_{ij} = \partial _{ij}f(\boldsymbol{\theta})\), and we see that this term represent the first nonvanishing correction to the function caused by the perturbation.
Our goal is to bound the absolute error
caused by the Gaussian noise, and we can do that by using the property that all the derivatives of most PQC (Parametrized Quantum Circuit) are bounded. In fact, for those circuits for which a parametershift rule holds [52, 53], one can show that any derivative of the function \(f(\boldsymbol{\theta}) = \langle O\rangle = \operatorname{Tr} [O U(\boldsymbol{\theta})0\rangle\langle0 U^{ \dagger}(\boldsymbol{\theta})]\) obeys
where \(\O\_{\infty}\) is the infinity norm of the observable, namely its largest absolute eigenvalue. We give a proof of this below in Sect. B.1.
Plugging this in Eq. (B.9), we can obtain an upper bound to the error \(\varepsilon _{\boldsymbol{\theta}}\) as desired. Indeed, remembering that for even numbers the double factorial can be expressed as \((2n1)!! = (2n)!/(2^{n} n!)\), it holds
where in the last line we used the definition of the exponential function \(e^{x} = \sum_{n=0}^{\infty} \frac{x^{n}}{n!}\).
One can see that the noise variance \(\sigma ^{2}\) must scale as the inverse of the number of parameters \(\sigma ^{2} \in \mathcal{O}(M^{1})\) in order to have small deviations induced by the noise. Also, note that since the difference between the noisefree function \(f(\boldsymbol{\theta})\) and its perturbed version \(f(\boldsymbol{\theta}+\delta \boldsymbol{\theta})\) cannot be larger than twice the maximum eigenvalue of O, \(f(\boldsymbol{\theta}+\delta \boldsymbol{\theta})f(\boldsymbol{\theta}) \leq f(\boldsymbol{\theta}+\delta \boldsymbol{\theta})+f(\boldsymbol{\theta}) = 2 \O\_{\infty}\), the bound (B.11) is informative only as long as \(\exp [M\sigma ^{2}/2]1 < 2\).
It is worth noticing that an identical procedure can be used to bound the average error obtained by approximating the perturbed function with its first nonvanishing correction given by the Hessian. Indeed, starting from Eq. (B.8) are repeating the same calculation from above, one obtains
2.1 B.1 Parametershift rule and bounds to the derivatives
Let \(f(\boldsymbol{\theta}) = \operatorname{Tr} [O U(\boldsymbol{\theta})0\rangle\langle0 U^{\dagger}( \boldsymbol{\theta})]\) be the expectation value of an observable O on the parametrized state \(\psi (\boldsymbol{\theta})\rangle = U(\boldsymbol{\theta})0\rangle\) obtained with a parametrized quantum circuit \(U(\boldsymbol{\theta})\). When the variational parameters \(\boldsymbol{\theta} \in \mathbb{R}^{M}\) enter in the quantum circuit via rotation gates of the form \(V(\theta _{i}) = \exp [i \theta _{i} P / 2]\) with {P}^{2}=\mathbb{1} being Pauli operators, then the parametershift rule can be used to evaluate gradients of the expectation value as [52, 53]
where \(\boldsymbol{e}_{i}\) is the unit vector with zero entries and a one in the ith position corresponding to angle \(\theta _{i}\). Similarly, by applying the parametershift rule twice one can express second order derivatives as follows using four evaluations of the circuit [35, 77]
In particular, for the diagonal elements \(i=j\), one has
where we used the fact that \(f (\boldsymbol{\theta} + \pi \boldsymbol{e_{i}}) = f (\boldsymbol{\theta} \pi \boldsymbol{e_{i}})\). This last equality can be seen intuitively from the 2π periodicity of the rotation gates or by direct evaluation. In fact, let \(U(\boldsymbol{\theta}) = U_{2} \exp [i \theta _{i} P_{i}/2] U_{1}\) be a factorization of the parametrized unitary where we isolated the dependence on the parameter \(\theta _{i}\) to be shifted. Then, since \(\exp [i 2\pi P /2] = \cos{\pi} \mathbb{I}  i\sin{\pi} P =  \mathbb{I}\), one has
and thus \(\langle\psi (\boldsymbol{\theta}\pi \boldsymbol{e}_{i})O\psi (\boldsymbol{\theta}\pi \boldsymbol{e}_{i})\rangle = \langle\psi (\boldsymbol{\theta}+\pi \boldsymbol{e}_{i})O\psi (\boldsymbol{\theta}+\pi \boldsymbol{e}_{i})\rangle\).
Hence, using Eq. (B.16) it is possible to estimate the diagonal elements of the Hessian matrix with just two different evaluations of the quantum circuit.
By repeated application of the parametershift rule one can also evaluate arbitrary higherorder derivatives as linear combination of circuit evaluations [35, 36]. Let \(\boldsymbol{\alpha} = (\alpha _{1}, \ldots , \alpha _{M}) \in \mathbb{N}^{M}\) be a multiindex keeping track of the orders of derivatives, and let \(\boldsymbol{\alpha} = \sum_{i=1}^{M} \alpha _{i}\). Then
where \(s_{m} \in \{\pm 1\}\) are signs, and \(\tilde{\boldsymbol{\theta}}_{m}\) are angles obtained by accumulation of shifts along multiple directions.
Since the output of any circuit evaluation is bounded by the infinity norm (i.e the largest absolute eigenvalue) of the observable \(\O\_{\infty }= \max \{o_{i} , O= \sum_{i} o_{i} o_{i}\rangle\langle o_{i}\}\)
then one can bound the sum in Eq. (B.18) simply as
2.2 B.2 Average value of the Hessian of random PQCs
In this section we derive the formulas (29) and (30) for the expected value of the Hessian as shown in the main text. Consider a system of n qubits and a parametrized quantum circuit with unitary \(U(\boldsymbol{\theta}) \in \mathcal{U}(2^{n})\), where \(\mathcal{U}(2^{n})\) is the group of unitary matrices of dimension \(2^{n}\). Given a set of parameter vectors \(\{\boldsymbol{\theta}_{1}, \boldsymbol{\theta}_{2}, \ldots , \boldsymbol{\theta}_{K}\}\), one can construct the corresponding set of unitaries \(\mathbb{U} = \{U_{1}, U_{2}, \ldots , U_{K}\}\), with \(U_{i} = U(\boldsymbol{\theta}_{i})\) and clearly \(\mathbb{U} \in \mathcal{U}(2^{n})\).
It is now well known that sampling a parametrized quantum circuit from a random assignment of the parameters is approximately equal to drawing a random unitary from the Haar distribution, a phenomenon which is at the root of the insurgence of Barren Plateaus (BPs) [54–56]. Specifically, it is numerically observed that parametrized quantum circuits behave like unitary 2designs, that is averaging over unitaries \(U_{i}\) sampled from \(\mathbb{U}\) yields the same result of averaging over Haarrandom unitaries, up until second order moments.
As standard in the literature regarding BPs, in the following we assume that the considered parametrized unitaries (and parts of them) are indeed 2designs, and so we make use of the following relations for integration over random unitaries [55–58, 78]
2.2.1 B.2.1 Statistics of the Hessian
Let \(f(\boldsymbol{\theta}) = \operatorname{Tr} [O U(\boldsymbol{\theta})0\rangle\langle0U(\boldsymbol{\theta})^{ \dagger}]\) and assume that the observable O is such that \(\operatorname{Tr} [O] = 0\) and \(\operatorname{Tr} [O^{2}] = 2^{n}\), as is the case of measuring a Pauli string. As shown in Eq. (B.16), diagonal elements of the Hessian matrix H can be calculated as
For simplicity, from now on drop the explicit dependence on the parameter vector θ when not explicitly needed. The variational parameters enter the quantum circuit via Pauli rotations \(e^{i\theta _{i} P_{i}/2}\) with \(P_{i} = P_{i}^{\dagger}\) and {P}_{i}^{2}=\mathbb{1}, and so the shifted unitary \(U(\boldsymbol{\theta}+\pi \boldsymbol{e}_{i})\) can be rewritten as
where \(U_{L}\) and \(U_{R}\) form a bipartition of the circuit at the position of the shifted angle, so that \(U(\boldsymbol{\theta}) = U_{L}U_{R}\).
Assuming that the set of unitaries \(\mathbb{U}_{L}\) generated by \(U_{L}\) is at least a 1design, one has that
where in the first line we exchanged the trace and the expectation value since both are linear operations, and in the second line we made use of Eq. (B.21) for the first moment of the Haar distribution. Similarly, one can show that if \(\mathbb{U}_{R}\) forms a 1design, then averaging over it yields the same result, namely \(\mathbb{E}_{U_{R}}[f(\boldsymbol{\theta}+\pi \boldsymbol{e}_{i})] = 0\). The same calculation for \(f(\boldsymbol{\theta})\) shows that \(\mathbb{E}_{U_{R}}[f(\boldsymbol{\theta})]=\mathbb{E}_{U_{L}}[f(\boldsymbol{\theta})] = 0\).
Thus, for every diagonal element of the Hessian, if either \(\mathbb{U}_{L}\) or \(\mathbb{U}\mathbbm{_{}\mathbb{R}\mathbbm{}}\) is a 1design (that is Eq. (B.21) hold), then its expectation value vanishes
The variance of the diagonal elements can be calculated in a similar manner, even though the calculation is more involved. Substituting Eq. (B.23) in the definition of the variance, one obtains
In order to use Eq. (B.22) for second moment integrals, we can rewrite these expectation values as follow
and similarly for the remaining two terms. Assuming that the set of unitaries \(\mathbb{U}_{L}\) generated by \(U_{L}\) is a 2design, then
where in the second line we made use of Eq. (B.22), and the third line the used that \(\operatorname{Tr} [B]=\operatorname{Tr} [B^{2}]=1\) since \(B = P_{i} U_{R}0\rangle\langle0U_{R}^{\dagger }P_{i}\) is a projector, and that \(\operatorname{Tr} [O]=0\) and \(\operatorname{Tr} [O^{2}]=2^{n}\). Similarly, one can show that integration over \(\mathbb{U}_{R}\) yields the same result. Also, the same calculation leads to \(\mathbb{E}_{U_{L}}[f(\boldsymbol{\theta})^{2}] = \mathbb{E}_{U_{R}}[f( \boldsymbol{\theta})^{2}] = 1/(2^{n}+1)\). Thus, if either \(\mathbb{U}_{L}\) or \(\mathbb{U}\mathbbm{_{}\mathbb{R}\mathbbm{}}\) is a 2design then
Now we evaluate the correlation term \(\mathbb{E}[f(\boldsymbol{\theta}+\pi \boldsymbol{e}_{i})f(\boldsymbol{\theta})]\). If \(\mathbb{U}_{L}\) is a 2design, then
While if \(\mathbb{U}_{R}\) is a 2design instead it holds
If both of them are 2designs, then continuing from Eq. (B.36), one obtains
Finally, plugging Eqs. (B.35), (B.36) and (B.37) in Eq. (B.29), one has \(\forall i=1,\ldots , M\)
where \(\mathbb{U}_{R} = \mathbb{U}_{R}^{(i)}\) and \(\mathbb{U}_{L} = \mathbb{U}_{L}^{(i)}\) are defined as in Eq. (B.24) and actually depend on the index i of the parameter.
Not surprisingly, as it happens for first order derivatives, also second order derivatives of PQCs are found to be exponentially vanishing [36, 56], as from Eq. (B.38) one can check that \(\operatorname{Var}[H_{ii}] \in \mathcal{O}(2^{n})\).
2.2.2 B.2.2 Statistics of the trace of the Hessian
The average value of the trace of the Hessian is easily found to be zero using Eq. (B.28), in fact
where we assume that for every parameter i either \(\mathbb{U}_{R}^{(i)}\) or \(\mathbb{U}_{L}^{(i)}\) is a 1design. The variance of the trace is instead
We can upper bound this quantity using the covariance inequality [79]
were we assumed that \(\operatorname{Var}[H_{ii}] \approx \operatorname{Var}[H_{jj}]\) \(\forall i,j\). Using that \(\operatorname{Var}[H_{ii}] \in \mathcal{O}(2^{n})\) one finally has
Alternatively, one can obtain a tighter yet qualitative approximation by explicitly considering the nature of the sums in Eq. (B.40). First, by using Eq. (B.23), the covariance term is explicitly
where for ease of notation we defined \(f_{i,j} = f(\boldsymbol{\theta}+\pi \boldsymbol{e}_{i,j})\) and \(f=f(\boldsymbol{\theta})\). Note that except for the first term which is always positive, all remaining correlations terms can be both positive and negative. Also, all of these terms are bounded from above by the same quantity, as via CauchySchwarz it follows
where we have used \(E[f^{2}]=E[f_{i}^{2}]=1/(2^{n}+1)\) from Eq. (B.34). Then, the variance can be written as
Numerical simulations
In addition to Fig. 6 in the main text, in Fig. 16 we report numerical evidence for the trace of the Hessian for two common hardwareefficient parametrized quantum circuit ansatzes. The histograms represent the frequency of obtaining a given value of the trace of the Hessian \(\operatorname{Tr} [H(\boldsymbol{\theta})]\) upon random assignments of the parameters. The length of the arrows are, respectively: “Numerical 2σ” (black solid line) twice the statistical standard deviation computed from the numerical results, “Approximation” (dashed red) twice the square root of the Eq. (B.44) with \(\Delta =0\), “Bound” (dasheddotted green) twice the square root of the upper Bound in Eq. (B.41).
All simulations confirm the bound (B.41), and, more interestingly, both the circuit on the left of Fig. 16 and the one in Fig. 6 in the main text, have a numerical variance which is very well approximated by Eq. (B.44) with \(\Delta = 0\). We conjecture this is due to the fact that all correlation terms in Eq. (B.44) are roughly of the same order of magnitude (see Eq. (B.43)), and can be either positive and negative, depending on the parameter and the specifics of the ansatz. Thus, one can expect the whole contribution to either vanish \(\Delta \approx 0\), or be negligible with respect to the leading term. If this is the case, then substituting \(\mathbb{E}[f^{2}] = 1/(2^{n}+1)\), the variance of the Hessian is approximately
which is four times smaller then the upper bound Eq. (B.41), but clearly has the same scaling. While we numerically verified it also at other number of qubits, more investigations are needed to understand if and when this approximation holds, and we leave a detailed study of this phenomenon for future work.
Appendix C: Visualization of CartPole policies obtained with Qlearning
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Skolik, A., Mangini, S., Bäck, T. et al. Robustness of quantum reinforcement learning under hardware errors. EPJ Quantum Technol. 10, 8 (2023). https://doi.org/10.1140/epjqt/s40507023001661
Received:
Accepted:
Published:
DOI: https://doi.org/10.1140/epjqt/s40507023001661
Keywords
 Variational quantum algorithms
 Quantum machine learning
 Quantum hardware noise