ResQNets: A Residual Approach for Mitigating Barren Plateaus in Quantum Neural Networks

The barren plateau problem in quantum neural networks (QNNs) is a significant challenge that hinders the practical success of QNNs. In this paper, we introduce residual quantum neural networks (ResQNets) as a solution to address this problem. ResQNets are inspired by classical residual neural networks and involve splitting the conventional QNN architecture into multiple quantum nodes, each containing its own parameterized quantum circuit, and introducing residual connections between these nodes. Our study demonstrates the efficacy of ResQNets by comparing their performance with that of conventional QNNs and plain quantum neural networks (PlainQNets) through multiple training experiments and analyzing the cost function landscapes. Our results show that the incorporation of residual connections results in improved training performance. Therefore, we conclude that ResQNets offer a promising solution to overcome the barren plateau problem in QNNs and provide a potential direction for future research in the field of quantum machine learning.


Introduction
The Noisy Intermediate-Scale Quantum (NISQ) devices are a new generation of quantum computers capable of executing quantum algorithms.However, NISQ devices still suffer from significant errors and limitations in terms of the number of qubits and coherence time [1].Despite these limitations, NISQ devices are an important stepping stone towards the development of fault-tolerant quantum computers, as they provide a platform for exploring and evaluating basic quantum algorithms and applications [2,3].Research in the NISQ era is focused on developing algorithms and techniques that are resilient to noise and errors, and can run effectively on NISQ devices [4].This includes algorithms for quantum error correction [5], quantum optimization [6], and quantum machine learning (QML) [7].
QML is an interdisciplinary field that combines the concepts and techniques from quantum computing and machine learning (ML).It aims to leverage the unique properties of quantum systems, such as superposition, entanglement, and interference, to develop new algorithms and approaches for solving complex machine learning problems [8,9].
QNNs are a promising area of research that aims to combine the power of quantum computing and neural networks to solve complex computational problems [34,35].Unlike classical neural networks, QNNs use quantum-inspired representations and operations to encode and process data [36][37][38].This allows for the exploration of exponential solution space and the exploitation of quantum parallelism, potentially leading to faster and more accurate results [8,14,39,40].QNNs can be considered as a subclass of variational quantum algorithms, which aim to optimize parameters (θ ) of a parameterized quantum circuit (PQC)1 U(θ ) to minimize the cost function C. PQC utilizes tunable parameters to optimize quantum algorithms through classical computation.An example of a QNN architecture is the quantum Boltzmann machine [41,42], which uses quantum circuits to model complex probability distributions and perform unsupervised learning tasks.In addition to unsupervised learning, QNNs have shown potential in various applications such as quantum feature detection [20], quantum data compression and denoising [43,44], and quantum reinforcement learning [45,46].QNNs can also be used for quantum-enhanced image recognition [7,47] and quantum molecular simulations [48].
However, despite their potential, QNNs are still in the early stages of development and face several technical and practical challenges.In particular, training and optimizing the parameters in QNNs pose significant challenges.To address these challenges, the research community has been developing quantum landscape theory [49] that explores the properties of cost function landscapes in QML systems.Consequently, interesting results have been obtained in the study of QNN's training landscapes, including the occurrence of barren plateaus (BP) [50], the presence of sub-optimal local minima [51], and the impact of noise on cost function landscapes [52][53][54][55].These findings provide important insights into the properties of QNNs and their training dynamics, and can inform the development of new algorithms and strategies for training and optimizing QNNs.
In particular, the BP problem refers to a phenomenon in which the circuit's expressiveness, as measured by its ability to approximate a target unitary operation, is severely limited as the number of qubits in the circuit increases [50], which is mainly due to vanishing gradients in the parameter space.The phenomenon of BP in QNNs is a significant challenge that impedes the advancement and widespread implementation of QNNs.To mitigate the BP, various strategies have been proposed, including the use of clever parameter initialization techniques [56], pre-training [57], examination of the dependence on the cost function [58,59], implementation of layer-wise training of QNNs [60], and initialization parameters drawn from the beta distribution [61].The trainability vs expressibility analysis of QNNs from the aspect of BP is conducted in [62], where a trade-off between quantum layers width and depth has been observed for a better learning performance.These solutions aim to overcome the limitations posed by the BP in QNNs and facilitate the full realization of their potential.However, it is important to note that the solution that works best for one QNN architecture may not work for another, as the BP problem can be highly dependent on the specific problem being solved and the quantum architecture being used.

Related work
In recent research efforts, the concept of utilizing the residual approach in QNNs has gained traction.One such work, proposed in [63], introduces a hybrid quantum-classical neural network with deep residual learning.The authors explore the integration of the residual block structure into QNN architecture and highlight its potential benefits.Specifically, they emphasize that connecting residual blocks with QNNs can enhance robustness against noise, a crucial consideration in quantum computing applications.
In contrast, our research focuses on a different aspect of the residual approach in QNNs.We aim to transform the traditional QNN architecture into a residual structure by dividing the conventional QNN architecture into multiple quantum nodes (each containing its own PQC), and investigate its effectiveness in mitigating the BP phenomenon.BP are a challenge that arises as the number of qubits in quantum circuit increases, leading to deep quantum circuits.Our primary objective is to address this issue and improve the training performance of QNNs.
Furthermore, a related study in [64] employs the residual approach in the optimization of an IoT platform.The authors also conclude that the residual approach in QNNs exhibits greater robustness against noisy data and better performance in learning unitary functions.
In a different context, [65] explores the residual approach in QNNs but with a focus on shallower quantum circuits rather than deep ones.The authors aim to achieve comparable performance with shallower circuits, and they suggest manipulating data encoding strategies to improve accuracy.Our work, however, entirely concentrates on quantum circuit width, i.e., the number of qubits, as a means to study and address the barren plateaus phenomenon, independent of the data encoding technique.
Overall, these research efforts collectively contribute to the growing body of knowledge on the residual approach in QNNs, highlighting its potential benefits for various quantum computing applications.

Contribution
In this paper, we propose a novel solution to mitigate the issue of barren plateaus (BP) in quantum neural networks (QNNs).Our approach is based on the concept of residual neural networks, which were previously introduced as a means of overcoming the problem of vanishing gradients in classical neural networks.In this context, we introduce the concept of residual quantum neural networks (ResQNets) by incorporating residual connections between two quantum layers of variable depths.Our findings suggest that the utilization of ResQNets substantially enhances the training process of QNNs as compared to their non-residual counterparts, denoted as PlainQNets.To substantiate the efficacy of our proposed ResQNets, we undertake a systematic comparison involving an analysis of the cost function landscapes and an assessment of their training performance with that of PlainQNets.The results obtained from our experimental investigations elucidate that the incorporation of residual connections in QNNs (ResQNets) effectively mitigate the adverse effects of BP and result in improved overall training performance.
Organization The rest of the paper is organized as follows: Sect. 2 provides an overview of classical and quantum residual neural networks and motivates their application.Section 3 discusses parameterized quantum circuits and elaborates on how multiple PQCs can be cascaded.This section also introduces the residual approach in cascaded PQCs.The methodology we adopt in this paper while conducting the various experiments is provided in Sect. 4. Section 5 presents the results we obtained on both the simulation environment and real quantum devices.Finally, the paper concludes in Sect.6 with a few concluding remarks and pointers to possible extensions to this work.

Residual neural networks
Residual Neural Networks (ResNets) are a type of deep neural network architecture that aims to improve the training process by addressing the problem of vanishing gradients.The basic idea behind ResNets is to introduce residual connections between layers in the network, allowing easier optimization as the network gets deepens.The residual connections allow the network to learn residual mapping rather than trying to fit the target function directly.This helps prevent the vanishing gradient problem, where the gradients in the backpropagation process become very small, making it difficult to update the parameters effectively.ResNets were first introduced in [66], where the authors showed that ResNets outperformed traditional deep neural networks on benchmark image recognition tasks and demonstrated that ResNets could accommodate significantly deeper architectures than previous networks without sacrificing accuracy.
Residual connections in ResNets have been shown to be effective in training very deep neural networks, with hundreds or even thousands of layers.This has drastically improved the performance in several computer vision and natural language processing tasks.A typical structure of a residual block is depicted in Fig. 1a.Given an input feature map x, the basic building block of a ResNet can be defined as: where H(x) is the output of the block, F is a non-linear function represented by a series of neuron and activation layers with parameters W i , and x is the input feature map that is added back to the output (the residual connection).The model is trained to learn the function F such that it approximates the residual mapping yx, where y is the desired output.By introducing residual connections, ResNets can address the vanishing gradient problem in deep neural networks, allowing for deeper architectures to be trained effectively.
In this paper, we introduce the quantum counterpart of ResQNet, namely residual quantum neural network (ResQNet), a QNN architecture combining the principles of classical ResNets with QNNs.The basic idea is to add a residual connection between the output of one layer of quantum operations and the input of the next layer.This helps to mitigate the vanishing gradient problem, a.k.a.BP, which is a major challenge in QNNs and arises as the number of qubits in the systems increases.Figure 1b directs how ResQNets is compared to ResNets.
In ResQNets, the residual connection is mathematically represented as: where ψ(θ ) is the input to the quantum circuit, U(θ ) is the unitary operation defined by the PQC, and ψ out (θ ) is the output.

Parameterized quantum circuits
QNN is a type of Parameterized Quantum Circuit (PQC), which is a quantum circuit that has tunable parameters that can be optimized to perform specific tasks.In a QNN, the parameters are typically optimized using classical optimization algorithms to learn a target function or perform a specific task.The PQC architecture of a QNN allows for the representation and manipulation of quantum data in a manner that can be used for various applications, such as QML and quantum control.The mathematical derivation of PQC involves the representation of quantum states and gates as matrices and the composition of these matrices to form the overall unitary operator for the circuit.
A quantum state can be represented by a column vector in a Hilbert space, where the elements of the vector are complex numbers that satisfy the normalization constraint: A quantum gate is represented by a unitary matrix, which preserves the norm of the vector, i.e., the inner product of the transformed vector with itself is equal to the inner product of the original vector with itself: where U † is the conjugate transpose of U and I is the identity matrix.A PQC can be modeled as a sequence of gates, each represented by a unitary matrix based on classical parameters.The overall unitary operator of the circuit can be obtained by composing the matrices of the individual gates in the correct order: where U i (θ i ) is the unitary matrix representing the i-th gate and θ i is a classical parameter.
The final quantum state after applying the PQC to an initial state can be obtained by matrix-vector multiplication: The parameters θ 1 , . . ., θ n can be optimized using classical optimization algorithms to achieve a desired quantum state or to maximize an objective function such as the expected value of a measurement outcome.The optimization problem can be written as: Solving this optimization problem provides the optimal set of parameters θ * that produce the desired outcome.

Cascading PQCs
In the proposed ResQNets, we encapsulate PQC/QNNs into a quantum node (QN) and arrange multiple QNs in a series, so that the output of one QN serves as the input of the next.This structure enables us to introduce the residual learning approach in a manner that allows the PQCs to work together to achieve the desired outcome.The process of cascading PQCs involves feeding the output of each PQC into the input of the next, creating a layered structure where each layer represents a single PQC.In this case, each PQC can build on the outputs of the previous ones, leading to a more complex and sophisticated computation.The residual learning approach is used to ensure that the overall computation remains stable, where the output of each PQC is combined with the input of the next in a specified manner.
We now present the mathematical formulation for connecting multiple PQCs in sequence.We will refer to each PQC as U i where i denotes the QN it is encapsulated in.

2-cascaded PQC
Consider two PQCs denoted as U 1 (θ 1 ) and U 2 (θ 2 ), where θ 1 and θ 2 are classical parameters.The first PQC U 1 (θ 1 ) is applied to an initial quantum state |ψ initial to obtain an intermediate quantum state |ψ intermediate : The second PQC U 2 (θ 2 ) is then applied to the intermediate state |ψ intermediate to obtain the final quantum state |ψ final : The overall unitary operator of the two cascaded PQCs can be obtained by composing the matrices of the individual PQCs in the correct order: The final quantum state after applying the two cascaded PQCs to an initial state can be obtained by matrix-vector multiplication: The parameters θ 1 and θ 2 can be optimized using classical optimization algorithms to achieve a desired quantum state or to maximize an objective function such as the expected value of a measurement outcome.The optimization problem can be written as: Solving this optimization problem returns the optimal set of parameters (θ 1 , θ 2 ) that produce the desired outcome.

n-cascaded PQCs
Similarly, for n cascaded PQCs, where each PQC takes the output of the previous one as its input, the intermediate states can be described as follows: where i = 1, 2, . . ., n and |ψ intermediate,0 = |ψ initial .The overall unitary operator of the n cascaded PQCs can be obtained by composing the matrices of the individual PQCs in the correct order: The final quantum state after applying the n cascaded PQCs to an initial state can be obtained by matrix-vector multiplication: The parameters θ 1 , θ 2 , . . ., θ n can be optimized using classical optimization algorithms to achieve a desired quantum state or to maximize an objective function such as the expected value of a measurement outcome.The optimization problem can be written as: Solving this optimization problem returns the optimal set of parameters (θ 1 , θ 2 , . . ., θ n ) that produce the desired outcome.

Residual PQCs
We now introduce residual blocks in the cascaded PQCs encapsulated in QNs which we call ResQNets.In ResQNets, the output of the previous PQC is added to its input and fed as an input to the next PQC.The residual block is inserted to facilitate efficient information flow and improved performance.The primary objective of incorporating residual blocks in QNNs here is to overcome the difficulties associated with BP and thereby improve the learning process.Furthermore, the proposed method aims to harness the strengths of both residual learning and quantum computing to tackle complex problems more effectively.
To mathematically formulate our proposed ResQNets, we start by considering the case of two PQCs, and extend the approach to the general case of cascading n PQCs with n residual blocks.We will refer to each PQC as U i where i denotes the QN it is encapsulated in.

1-residual block
ResQNet with a single residual block contains a maximum of two PQCs of arbitrary depth enclosed in two separate QNs.The first QN serves as a residual block whose input is added to its output before passing it as input to the PQC in the next QN.In the context of the NISQ era, hybrid QNNs have gained considerable traction.These models exhibit a distinctive architecture wherein the input data, characterized by its classical nature, necessitates an initial encoding process.This encoding procedure plays a vital role in preparing the data for processing on quantum computer.It is important to note that in this paper we exclusively employ a configuration wherein classical datasets are not utilized.Instead, our approach involves the initialization of qubits in ground states and the gates in PQC are randomly parameterized prior to the training phase.Nevertheless, here we present a comprehensive mathematical framework that accommodates the broader context of hybrid systems.This formalism is especially pertinent in scenarios where classical datasets form an integral component of the computational process.An illustrative configuration featuring a pair of Quantum Nodes (QNs), where the initial node functions as the residual block, is depicted in Fig. 2.

Figure 2 Illustration of residual approach in hybrid quantum neural networks
The two QNs will have two PQCs denoted as U 1 (θ 1 ) and U 2 (θ 2 ), where θ 1 and θ 2 are classical parameters encoded in such a way that the quantum circuit can process them.The classical dataset is a set of data points, i.e., D = {x (i) }, where x (i) represents the ith datapoint.The next step is to encode the classical data Each data point x (i) , is encoded using an encoding method (e.g., angle or amplitude encoding [67]).The angle-encoded data for the ith data point can be denoted as θ (i) .Each encoded data point θ (i) is used as parameters for the PQC in the first quantum node (QN1).This PQC processes the encoded data and upon measurement, it generates a classical result y (i)  1 .The classical result y (i)  1 from QN1 is added to the original data x (i) element-wise to obtain a new modified classical dataset denoted by D = {x (i) + y (i)  1 }.Each data point in D is then encoded again to obtain a new set of encoded data denoted by θ (i) which is then used as input for a PQC in the second quantum node (QN2).The PQC in second QN processes this encoded data and upon measurement it generates classical result y (i)  2 which, in case of two QNs, is the final output of the network used for cost function optimization.
The mathematical formulation for a single residual block in two QN setting starts with preparing and initializing the qubits.We initialize the qubits in ground state: where n is the number of qubits and the superscript denotes the PQC number number, i.e., 1 here denotes the qubit initialization in the first PQC.After the qubit initialization, the next step is to encode the classical data features into quantum space: where f encode is the encoding function which maps the classical input features to quantum space.The first PQC U 1 (θ 1 ) is applied to the encoded input features to obtain an intermediate quantum state |ψ intermediate : (1)  initial .
Upon measurement the intermediate quantum state |ψ intermediate collapses and returns the classical result: where M denotes the qubit measurement.The qubits in second PQC are also prepared in ground state: where the superscript (2) denotes the PQC number.Now, the input of the second PQC U 2 (θ 2 ) is not just the output of QN1 but the sum of the original input (x (i) ) and the intermediate result (y 1 ).
We again have to encode the new data points obtained after addition of original input and intermediate result before passing it to the PQC in QN2.
The final quantum state obtained after the second QN would be: (2)  initial .
After measuring the qubits in second QN, the final quantum state collapses and we get the final classical result: The parameters θ can be optimized using classical optimization algorithms to achieve a desired quantum state or to maximize an objective function such as the expected value of a measurement outcome.

2-residual blocks
In ResQNets with two residual blocks, up to three PQCs can be incorporated within three QNs.There are three potential configurations for the residual blocks in this setup: 1. utilizing only the first QN as a residual block, 2. combining the first two QNs to form a single residual block, 3. utilizing both the first and second QNs individually as separate residual blocks.
For our mathematical formulation, only the third configuration will be considered since it is the general setting for the case of two residual blocks; other configurations effectively contain a single residual block, which has already been mathematically derived in Sect.3.2.1.However, we will conduct experiments that examine all three configurations to determine which configuration performs the best.Let U 1 (θ 1 ), U 2 (θ 2 ), and U 3 (θ 3 ) be PQCs enclosed in three QNs, where θ 1 , θ 2 , and θ 3 are the quantum-encoded classical parameters.The qubits in the first PQC are intialized in ground state: The intial classical input features are encoded into qubit rotation angles: The first PQC U 1 (θ 1 ) takes the initial quantum state |ψ initial as its input and produces an intermediate quantum state |ψ intermediate : |ψ (1)  initial .
The |ψ intermediate is the output of U 1 (θ 1 ) before the measurement.Upon measuring the qubits the quantum state |ψ intermediate collapses and produces a classical result: Before passing the output of QN1 (y 1 ) as input to QN2, it is first added element-wise with the original input x (i) .Since, both y 1 and x (i) are classical values, therefore an encoding function is again applied in order for the PQC in QN2 to process them: The encoded data is then passed to second PQC, yielding another intermediate quantum state |ψ intermediate : where |ψ (2)  initial denotes the ground state initialization of qubits in second PQC.The quantum state |ψ intermediate is the result before measurement and upon measuring the PQC in QN2, we get the classical result: Finally, the third PQC U 3 (θ 3 ) takes the sum of output of QN1 (which also is the input of QN2) and output of QN2 as input and produces the final quantum state |ψ final : Since x (i) is again classical, so it has to encoded before passing it to the U 3 (θ 3 ): The final quantum state obtained after the action of U 3 (θ 3 ) will be: where ψ (3)  input denotes the ground state initialization of qubits in third PQC.The final quantum state collapses into a classical vector after measurement, which will be the final result of the network used for further optimization.
The same procedure can be extended for n-residual blocks with different residual configurations.
Given a set of n PQCs, U 1 (θ 1 ), U 2 (θ 2 ), . . ., U n (θ n ) and an initial quantum state |ψ initial , the objective is to find the set of parameters θ = θ 1 , θ 2 , . . ., θ n that maximizes (or minimizes) some cost function C(θ ) associated with the final quantum state |ψ final produced by the cascaded PQCs.The optimization problem can be formulated as: where θ * represents the optimal set of parameters that maximizes (or minimizes) the cost function.The cost function C(θ ) can be defined based on the desired behavior of the quantum circuit and can be calculated from the measurement result of final quantum state |ψ final .

Methodology
In classical NNs, residual neural networks (ResNets) were proposed to overcome the problem of vanishing gradients and were very useful for enabling deep learning in classical machine learning.In this paper, we propose a Residual Quantum Neural Networks (ResQNets), to enable deep learning in QNNs by mitigating the effect of BP as a function of the number of layers.
The conventional approach to constructing QNNs contains an arbitrarily deep PQC, which takes some input and yields some output.Such an architecture typically has a single QN, as depicted in Fig. 3a.In this paper, we refer to this traditional QNN architecture as "Simple PlainQNet".
To construct our proposed ResQNets, we need to further split the traditional QNN architecture into two QNs, where every QN contains arbitrary deep quantum layers.Since our proposed ResQNets contain at least two QNs and the traditional way of constructing QNNs contains a single QN, we construct a slightly modified version of simple PlainQNet, which we call "PlainQNet" and includes two or more QNs, with each QN containing PQCs of arbitrary depth, as shown in Fig. 3b.In PlainQNets, the output of the previous QN is fed to the next QN.The purpose of constructing PlainQNet is to have a fair comparison with our proposed ResQNets because ResQNets need two or more QNs to work.An example of ResQNet architecture with two QNs is shown in Fig. 3c.The PlainQNet architecture is similar to general QNN split into two QNs, whereas in the case of ResQNet, the first QN serves as the residual block, i.e., the input of the first QN is added to its output and then passed as input to the second QN.
It should be noted that ResQNets can comprise multiple QNs with various arrangements of residual blocks.For instance, the ResNet from Fig. 3c can be extended to have three QNs, in which case three potential configurations can be employed.These include having the first and second QNs acting as individual residual blocks, combining the first and second QNs to serve as a single residual block, and only the first QN functioning as the residual block.The possibility of these three configurations has been taken into consideration.We also consider the case of three QNs with these configurations.

Quantum layers design
For the design of quantum layers, we use a periodic structure containing two single-qubit unitaries (RX and RY ) per qubit.These unitaries are randomly initialized in the range [0, π].Furthermore, a two-qubit gate, i.e., CNOT-gate is used to entangle qubits, and every qubit is entangled with its neighboring qubit.Figure 4 shows the example design of the quantum layers we used (5 qubits).All the QNs in our experiments have the same quantum layers desgin.

Depth of quantum layers
The impact of the quantum layer depth in examining the existence of BP in the cost function landscape of a QNN is significant.Effective depth (the longest path within the quantum circuit until the measurement) is crucial in this regard.For convenience, We introduce two depth parameters: layer depth (D L ) and effective depth (D E ).The layer depth D L refers to the combined number of repetitions of the quantum layer illustrated in Fig. 4 in both QNs, while the effective depth D E represents the overall depth.For our quantum layers design, the following equation can be used to calculate the effective depth.

Depth distribution per QN
As previously discussed, ResQNets and PlainQNets consist of multiple QNs, which results in different depth splits for a given depth of quantum layers.According to the definition of BP, the gradient vanishes as a function of the number of qubits; hence, we fix the depth of quantum layers to D L = 6, and only vary the number of qubits.Table 1 summarizes the different depth per QN combinations for D L = 6, and all these depth combinations are tested for different numbers of qubits.Column 3 of Table 1 represents the depth split in the form of ordered pairs (we refer to this form in the rest of the paper whenever we discuss depth split per QN).For instance, (1, 5) denotes D L = 1 in the first QN and D L = 5 in the second QN.The depth per QN combination can be extended to more than two QNs in a similar manner.

Cost function definition
For training our proposed ResQNet, we consider a simple example of learning the identity gate.In such a scenario a natural cost function would be the difference of 1 minus the probability of measuring an all-zero state, which can be described by the following equation.
We consider the global cost function setting, i.e., we measure all the qubits in the network.Therefore, the above cost function definition will be applied across all the qubits according to the following equation.
For cost function optimization, we use Adam optimizer (with a stepsize of 0.1), which is a gradient-based optimization method for optimization problems.The Adam optimizer updates the parameters of a model iteratively based on the gradient of the loss function with respect to the parameters.The Adam optimizer uses an exponentially decaying average of the first and second moments of the gradients to adapt the learning rate for each parameter.Let g t be the gradient of the loss function with respect to the parameters at iteration t.The first moment, m t , and the second moment, v t , are computed as follows: where β 1 and β 2 are the decay rates for the first and second moments, respectively.The bias-corrected first moment and second moment are then computed as: Finally, the parameters are updated using the following equation: where α is the learning rate and is a small constant to prevent division by zero.

Results and discussion
In order to investigate the issue of BP in both PlainQNets and ResQNets, we maintain a constant depth of quantum layers, D L = 6, which comprises 100 quantum gates and 60 parameters.The quantum layer depth distribution is varied among different combinations, as discussed in Table 1.The D E per QN can then be calculated using Eq. ( 2).The performance of both networks is evaluated by comparing their cost function landscapes and training results for the problem specified in Eq. (3).

PlainQNet and simple PlainQNet
In this paper, the construction of the proposed ResQNets involves the incorporation of a minimum of two QNs, whereas traditionally QNNs development entails the use of a single QN (referred to as "simple PlainQNets in this paper).To ensure a fair performance comparison of QNNs with no residual connections and our proposed ResQNets, we modify the architecture of simple PlainQNets by dividing it into two QNs (referred to as "Plain-QNets" in this paper).This architectural modification is primarily aimed to have a similar architecture of PlainQNets (QNNs with no residual connection) and ResQNets before comparing their performance.
Given the modification introduced to the conventional QNN architecture, as stated above, it is necessary to comparatively analyze the performance of the unaltered simple PlainQNets and the adapted PlainQNets.This preliminary comparison aims to identify any potential consequences arising from the structural modification.If this architectural modification results in minimal disruptions to performance, it would establish a basis for conducting a subsequent comparative analysis between PlainQNets and ResQNets with confidence.The simple PlainQNets and PlainQNets are compared for 6-qubit and 7-qubit quantum layers with a constant depth of D L = 6.In the case of PlainQNets, the depth distribution per QN can vary, but we use the depth combinations of (5, 1) and (4,2), where the first entry represents the depth of the first QN and the second entry represents the depth of the second QN, as shown in Table 1.We choose deeper quantum layers on the first QN and relatively shallow depth on the second QN primarily because such a configuration of depths per QN leads to a better performance, which will be discussed in more detail in the subsequent sections.For 6-qubit quantum layers, the effective depth (D E ) for Plain-QNets for both depth combinations mentioned above is 30 (as defined in Eq. ( 2)).The closest possible D E for simple PlainQNets using the quantum layers considered in this paper (shown in Fig. 4) is 31 with an overall D L of 7 (as defined in Eq. ( 1)), which was used in the comparison.Similarly, for 7-qubit quantum layers, the D E for PlainQNets is 32 for both depth combinations per QN.The closest D E in the case of simple PlainQNets is obtained for D L = 7.
Both PlainQNets and simple PlainQNets are then trained for the problem specified in Eq. ( 3).The training results are displayed in Fig. 5.It can be observed that for 6-qubit layers, both PlainQNets and simple PlainQNets exhibit comparable performance.However, when the number of qubits increases to 7, the performance of simple PlainQNets decreases significantly due to BP, while PlainQNets improves.Based on these observations, we can infer that it is appropriate to compare the performance of PlainQNets with that of our proposed ResQNets.Hence, for the remainder of the paper, we will compare the performance of PlainQNets, which are QNNs containing two (or more) QNs, with that of ResQNets.

ResQNet with shallow width quantum layers
In this section, we perform a comparative analysis of the incidence of BP in both Plain-QNets and ResQNets.Both PlainQNets and ResQNets consist of two QNs, with a maximum of one residual block in the case of ResQNets.To facilitate a fair comparison, we consider shallow depth quantum layers with D L = 6 and incrementally vary the number of qubits from 6 to 10.

6-qubit circuit
In this setting, we experiment with a total of 6 qubits.The cost function landscapes for both PlainQNet and ResQNet were analyzed and compared, as shown in Fig. 6.The results demonstrate that a significant portion of the cost function landscapes of the PlainQNet for almost all the depth combinations are flat and have a narrow region containing the global minimum.On the other hand, the cost function landscapes of ResQNets are less flat and have a wider region containing the global minimum, which makes ResQNet more suitable for optimization.
The training of PlainQNets and ResQNets was performed for the problem defined in Eq. ( 3).The results of the training are depicted in Fig. 7.When the depth of the second QN is equal to or greater than the depth of the first QN, it was observed that the Plain-QNets do not undergo successful training.This can be attributed to the flat cost function landscape, i.e., the BP, as depicted in Fig. 6.For the similar depth distribution per QN (depth in second QN ≥ depth in first QN), the ResQNets were observed to effectively undergo training.However, they struggled to reach an optimal solution due to the presence of multiple local minima in their cost function landscape.In instances where the depth of the first QN is greater than the second QN, both PlainQNets and ResQNets underwent successful training, but ResQNets outperformed PlainQNets.

8-qubit ciruit
We now conduct experiments on both PlainQNets and ResQNets with 8-qubit layers, and examine the cost function landscapes of both PlainQNets and our proposed ResQNets.The overall layer depth is set to 6, and all depth combinations are analyzed.The results presented in Fig. 8, reveal that approximately 90% of the cost function landscape for Plain-QNets remains flat irrespective of the depth distribution per QN, making them unsuitable for optimization.In contrast, the cost function landscapes of ResQNets are still not flat for all the depth combinations, and thus are more favorable for optimization.
We conduct training experiments for both PlainQNets and ResQNets with 8 qubit quantum layers to solve the problem defined in Eq. ( 3).The training results are presented in Fig. 9, which shows that as we increase the number of qubits from 6 to 8, the PlainQNets get trapped in the flat cost function landscape (i.e., BP) for all the depth combinations per QN and fail to train effectively for the specified problem.
On the other hand, the ResQNets demonstrate successful training across all the depth combinations, surpassing the performance of PlainQNets.Notice that ResQNets exhibit superior learning outcomes when the depth of the first QN is much greater than that of the second QN (D E in QN1 > > > > D E in QN2), such as in the case of (5,1).This is because in such scenarios the cost function landscape has fewer and wider regions leading to the global minimum.Conversely, when the depth of the second QN is equal to or greater than that of the first QN, the cost function landscape is characterized by multiple local minima, making it less suitable for optimization as the optimizer becomes trapped in local minima.This phenomenon can be attributed to the presence of residual blocks in ResQNets.In the

10-qubit circuit
To expand our study further, we increased the number of qubits to 10 and performed the same experiments as with quantum layers of 6 and 8 qubits.The cost function landscapes were then analyzed for both PlainQNets and ResQNets, as shown in Fig. 10.Similar to the case of 8 qubit layers, a substantial portion of the cost function landscape of PlainQNets was found to be flat, indicating the presence of BP and making it unsuitable for optimiza- Subsequently we trained the 10 qubit quantum layers to address the problem defined in Eq. ( 3).The results of these experiments are depicted in Fig. 11.Our analysis indicates that PlainQNets did not exhibit successful training outcomes for nearly all depth combinations, with the exception of (4, 2), which showed considerable performance improvement.When we examined its cost function landscape in Fig. 10, we observed that there exist one or two narrow regions that contain the solution and may be found by the optimizer to converge to the solution.However, these narrow regions are unlikely to be encountered and thus the performance, despite being optimal, is not considered suitable for general optimization problems.Therefore, it can still be concluded that the PlainQNets are severely affected by the problem of BP.On the other hand, ResQNets effectively overcame the issue of BP and demonstrated successful training outcomes for all depth combinations.Our observations for 10 qubit quantum layers align with our previous findings for 6 and 8 qubit layers in that ResQNets are more effective when the depth after the residual connection is less.This suggests that a shallower depth of quantum layers after the residual connection in ResQNets is more favorable for optimization and mitigating the impact of BP.
Our results conclusively demonstrate that PlainQNets are heavily impacted by the issue of BP as the number of qubits increases, which significantly hinders their performance and ability to optimize the cost function.The previous results have demonstrated the advantage of our proposed ResQNets over PlainQNets in mitigating the phenomenon of BP.Therefore, in the next section, we will conduct experiments solely with ResQNets.

ResQNets with wider quantum layers
To analyze the scalability of ResQNets for larger quantum circuits, we consider quantum layers with a larger number of qubits, i.e., 15 and 20.The depth of the quantum layers, D L , is kept constant at 6.As the cost function landscapes are known to have a direct impact on the training results, as shown in Sect.5.2.Consequently, we only present the training results for the 15 and 20-qubit quantum layers.12a.It can be observed that the ResQNets are effectively trained.Additionally, analogous to the case of shallow width quantum layers, the performance is substantially better when the depth in the first QN (before the residual point) is bigger than the second QN.

20-qubit circuit
We now train the ResQNets for 20-qubit layers for the problem defined in Eq. ( 3), with a total layer depth of D L = 6.It can be observed that even with 20 qubit layers, the ResQNets are effectively trained, as shown in Fig. 12b.Furthermore, similar to the previously shown results, the ResQNets for 20-qubit layers also perform significantly better when the depth after the residual point (second QN) is lesser than the depth before the residual point (first QN).
From the results in Fig. 12, it is evident that the ResQNets are capable of working with wider quantum layers.The results demonstrate that analogous to the case of shallowwidth quantum layers, the training performance is better with optimal results being achieved for a larger depth in the first QN and a smaller depth in the second QN.
It should be noted that our experiments are limited by the memory constraints of our local computer and we cannot go beyond 20 qubits.However, based on our findings, we believe that the proposed ResQNets would still train effectively even beyond 20 qubits.

ResQNets with 3-QN
From the analysis presented in previous sections, it can be observed that the ResQNets consisting of two QNs with a maximum of one residual block can effectively address the problem of BP and significantly improve the training performance of QNNs.In this section, we show that increasing the number of QNs in ResQNets can enhance the performance of ResQNets even further.As discussed in Sect.4, for three QNs we can have multiple configurations of residual blocks.We consider all of these configurations for our experiments with 20-qubit quantum layers and a fixed quantum layer depth of D L = 6.The results of the experiments conducted in this section will provide valuable insights into the optimal configuration of residual blocks for ResQNets with three or more QNs.The cost function landscapes of various residual block configurations in ResQNets with three QNs were analyzed, as presented in Fig. 13.The results indicate that the optimal placement of residual blocks has a significant impact on the performance of ResQNets.
When the residual block is added after every QN, the cost function landscape quickly flattens irrespective of the depth per QN, suggesting that this configuration leads to equivalent or suboptimal performance compared to PlainQNets, which is not at all suitable for optimization.
On the other hand, when the residual block is added after two QNs, the cost function landscape shows multiple and wider regions containing the global minimum, which makes this configuration more suitable for optimization.Moreover, this configuration exhibits a consistent cost function landscape regardless of the depth per QN combination, implying For the case of adding the residual only after the first QN, with two QNs after the residual block, the show that the cost function landscape is better than the case of adding the residual block after every QN, but not as good as the case where there is a gap of two QNs while adding the residual.
We then trained ResQNets with three QNs for all the configurations while varying the depth for each QN combination on the problem defined in Eq. ( 3).The training results are shown in Fig. 14.These results align with the behavior of the cost function landscape, where the residual block configuration skipping two QNs outperforms other configurations.It can be observed that the residual block configuration after every QN does not train at all, while the residual block configuration after the first QN does converge for all the depth per QN combinations, but with significantly slower convergence compared to the residual block configuration after two QNs.

3-QN vs. 2-QN ResQNet
In this section, we compare the performance of ResQNets with 2 and 3-QNs to demonstrate the impact of increasing the number of QNs.The analysis was conducted for 20 qubit layers considering the best-performing depth combinations for both 2 and 3-QNs.
For 2-QNs, the results from Fig. 12b indicate that the depth combinations of (5,1) and (4,2) performed better than other depth combinations.On the other hand, for three QNs, the results from Fig. 14b and 14c show that the depth combinations of (4 1, 1) and (4, 1 1) outperformed other depth combinations.A closer examination of the best-performing depth combinations reveals that the D L before and after the residual block for the depth per QN combination of (5, 1) in 2-QN ResQNet is equivalent to depth per QN combination of (4 1, 1) for 3-QN ResQNet.Similarly, the combination (4, 2) in the 2-QN ResQNet is equivalent to (4, 1 1) in the 3-QN ResQNet.Despite these similarities, as demonstrated in Fig. 15, the ResQNets with 3-QNs exhibit superior performance, as they converge to the optimal solution more efficiently compared to the ResQNets with 2-QNs.

Real quantum device
The results presented so far were obtained by running ResQNets and PlainQNets on a simulation platform.In this section, we carry out some experiments on real quantum devices.In particular, we trained both ResQNets and PlainQNets with 2-QNs on a 5-qubit quantum layer with 20 epochs using an IBM's quantum device, namely ibmq_lima.The quantum layers depth was fixed to D L = 6 with D L = 5 in the first QN, and D L = 1 in the second QN.This depth combination was chosen considering all the results discussed previously.We note that due to the limited number of publicly available quantum devices, the queue times for executing the jobs are considerably long.Therefore, to minimize the training time, we chose to reduce the number of epochs for real-device training.We trained both PlainQNets and ResQNets for only 20 epochs on real devices instead of 100 epochs as in the case of simulation.The training results are illustrated in Fig. 16.
The results presented in Fig. 16a reveal that ResQNets have been trained successfully on a real device, whereas PlainQNets have not been trained on a real device.The same trend is observed when both networks are executed on the simulator, as depicted in Fig. 16b.However, when both PlainQNets and ResQNets are trained on a real device, a slight fluctuation is observed while approaching the optimal solution due to hardware noise, as compared to the simulation results.Despite the presence of noise, the rate of decrease in the loss value for ResQNets is almost identical for both simulation and real experiments.According to [52], hardware noise can potentially cause BP.However, our results demonstrate that our proposed ResQNets are somewhat resilient against hardware noise, as they achieve similar performance to that of the simulator (though with some fluctuations).

Conclusion
The problem of barren plateaus (BP) in quantum neural networks (QNNs) is a critical hurdle on the road to the practical realization of QNNs.There have been several attempts to resolve this issue, but the impact of BP can still vary greatly depending on the application and the architecture of the quantum layers.Thus, it is essential to have multiple solutions for BP to cover a wide range of problems.
In this paper, we propose residual quantum neural networks (ResQNets) to address the issue of BP in QNNs.Our approach is inspired by classical residual neural networks (ResNets), which were introduced to overcome the problem of vanishing gradients in classical neural networks.
In traditional QNNs, a single parameterized quantum circuit (PQC) with arbitrary depth is included within a single quantum node (QN).To create ResQNets, we split the conventional QNN architecture into multiple QNs, each of which contains its own PQC with varying depths.Splitting the QNNs allows us to introduce the residual connections between the QNs, forming our proposed ResQNets.In simple QNNs without residual connections (referred to as PlainQNets), the output from the previous QN serves as the input to the next.On the other hand, in ResQNets, one or multiple QNs can serve as residual blocks, with the output from a previous residual block being added to its input before it is passed on to the next QN.
In our study, we first demonstrate the efficacy of the proposed splitting of the conventional QNN architecture into multiple QNs (PlainQNets) by comparing their performance to that of conventional QNNs (simple PlainQNets).The comparison results indicated that the PlainQNets perform better than or equivalent to that of conventional QNNs.Subsequently, we compare the performance of PlainQNets with that of our proposed ResQNets through several training experiments.Our analysis of the cost function landscapes for quantum layers of increasing qubits shows that incorporating residual connections results in improved training performance.
Based on our findings, we conclude that the proposed ResQNets provide a promising solution to overcome the problem of BP in QNNs and offer a potential direction for further research in the field of quantum machine learning.

Figure 1
Figure 1 Residual block structure

Figure 3
Figure 3 QNN architecture used in this paper (a) Simple PlainQNet (b) PlainQNet and (c) ResQNet.The internal architure and working of QN is shown Fig. 2

Figure 4
Figure 4 Quantum Layers Design

Figure 5
Figure 5 Cost vs. iterations of PlainQNets and simple PlainQNets (a) for 6 qubits (b) for 7 qubits.The parentheses denote the D L per QN

Figure 6 Figure 7
Figure 6 Cost function landscapes of PlainQNet (upper panel) and ResQNet (lower panel) for 6 Qubits.The parentheses denotes the D L per QN

Figure 8 Figure 9 Figure 10
Figure 8 Cost function landscapes of PlainQNet (upper panel) and ResQNet (lower panel) for 8 Qubits.The parentheses denote the D L per QN

Figure 11
Figure 11 Cost vs. iterations of (a) PlainQNets (b) and ResQNets for 10 qubits.The parantheses denotes the D L per QN

Figure 12
Figure 12 Cost vs. iterations of ResQNets for (a) 15 qubits and (b) 20 qubits.The parentheses denote the D L per QN

Figure 13
Figure 13 Cost function landscapes of ResQNets for 20 Qubits and 3-QNs.Residual after every QN (Top panel), Residual after two QNs (middle panel) and residual only after the first QN (bottom panel).The parentheses denote the D L per QN and the comma denotes the residual point

Figure 14
Figure 14 Training results of ResQNets with three QNs with 20 qubit layers.(a) Residual after every QN (b) Residual after two QNs and (c) Residual after the first QN.The parentheses denote the D L per QN and the comma denotes the residual point

Figure 15 Figure 16
Figure 15 Training comparison of 2-QN and 3-QN ResQNets for 20 qubit layers.The parentheses denote the D L per QN and the comma denotes the residual point

Table 1
Depth combinations per QN formance, it is important to calculate D E of each QN individually and then add them to obtain the final D E .Failure to calculate the depth in each QN separately could result in an effective depth different from the sum of the effective depths of each QN, i.e., D L /QN1 + D L /QN2 = D E .For example, with D L = 2, the total effective depth would be 10 without considering the splitting into two QNs.However, if D L is split into two QNs with D L /QN = 1, the effective depth would be 12.A modified version of Eq. (1) should be used to calculate the D E per QN, as described below.