Deep reinforcement learning for universal quantum state preparation via dynamic pulse control

Accurate and efficient preparation of quantum state is a core issue in building a quantum computer. In this paper, we investigate how to prepare a certain single- or two-qubit target state from arbitrary initial states in semiconductor double quantum dots with the aid of deep reinforcement learning. Our method is based on the training of the network over numerous preparing tasks. The results show that once the network is well trained, it works for any initial states in the continuous Hilbert space. Thus repeated training for new preparation tasks is avoided. Our scheme outperforms the traditional optimization approaches based on gradient with both the higher designing efficiency and the preparation quality in discrete control space. Moreover, we find that the control trajectories designed by our scheme are robust against static and dynamic fluctuations, such as charge and nuclear noises.


Introduction
Future quantum computers promise exponential speed-ups over their classical counterparts in solving certain problems like search and simulation [1]. A wide variety of promising modalities emerges in the race to realize the quantum computer, such as trapped ions [2,3], photonic system [4][5][6][7], nitrogen-vacancy centers [8], nuclear magnetic resonance [9], superconducting circuits [10,11] and semiconductor quantum dots [12][13][14][15][16][17][18]. Among these the semiconductor quantum dots is a powerful competitor for potential scalability, integrability with existing classical electronics and well-established fabrication technology. Spins of electrons trapped in quantum dots structure based on Coulomb effect can serve as spin-qubits for quantum information [19]. Spin qubits can be encoded in many ways, such as spin-1/2, singlet-triplet (S-T 0 ) and hybrid systems [20]. In particular, the spin S-T 0 qubit in double quantum dots (DQDs) attracts much attention for the merit that it can be manipulated solely with electrical pulses [21][22][23].
It has been proved that several arbitrary single-qubit gates plus an entangling two-qubit gate are the prototypes of all other logic gates in quantum algorithm implemented on a circuit-model quantum computer [1,24]. In an authentic sense, the implementation of any single-and two-qubit gates can be reduced to the state preparation problems. Arbitrary manipulations of a single-qubit can be achieved by successive rotations around the x-and z-axes on the Bloch sphere. In the context of S-T 0 single-qubit in semiconductor QDQs, the only tunable parameter J is the rotation rate around the z-axis, while the rotation rate h around the x-axis is difficult to be changed [25].
Various schemes have been proposed to add proper pulses on J to control the qubits [26][27][28]. It is typically required to iteratively solve a set of nonlinear equations [29,30] for analytically tailoring the control trajectory, so it is a computationally exorbitant and timeconsuming task in practice. There are also several traditional optimal methods based on gradient that can be used to design the control trajectory, such as stochastic gradient descent (SGD) [31], chopped random-basis optimization (CRAB) [32,33] and gradient ascent pulse engineering (GRAPE) [34,35]. However, the intensities of their pulses are nearly continuous, which may leave challenges to the experimental implementation. While the requirement of discrete pulses will inevitably reduce their performance [36]. In addition, their efficiency is limited by their iterative nature, which makes the task of designing pulses a big burden especially when there exist a large number of states waiting to be processed.
Recently, the generation of arbitrary states from a specific state [53] in nitrogen-vacancy center has been realized with the aid of the deep RL. Then it is intriguing to check if the deep RL can be used to realize a contrary problem: preparing a certain target state from arbitrary initial states, i.e., universal state preparation (USP). In practical quantum computation, it is often required to reset an arbitrary state to a specific target state [54][55][56].
For example, the initial state of the system always needs to be set to the ground state when transferring a quantum state through a spin chain [54,55]. In the realization of quantum Toffoli or Fredkin gate, the ancilla state must be preprepared to the standard state |0 in certain cases [57][58][59]. Generation of two-qubit entangled state is also required [1] in completing quantum information processing tasks, such as the teleportation [60,61]. Note that the network typically requires being trained again once the preparing task changes [36,46]. Thus, the designing task of control pulses could be an exhausting work when there are lots of different states waiting to be prepared to a certain target state. In this paper, we investigate this USP problem with the deep RL in such a constrained driving parameters system. Benefited from a more sufficient learning on numerous preparing tasks, we find that the USP can be achieved with a single training of the network. Evaluation results show that our scheme outperforms the alternative optimization approaches both in terms of the efficiency of pulses designing and preparation quality in discrete control space. In addition, we find that the average step of control trajectories designed by our USP algorithm is obviously less than that of alternatives. Moreover, we discuss the robustness of the control trajectories designed by our USP algorithm against various noises and explore the major source of errors in control accuracy. We point out that by combining our scheme with Ref. [53], the driving between arbitrary states can be realized.

Models and methods
At first, we present the models of electrically controlled S-T 0 single-and two-qubit in semiconductor DQDs in Sects. 2.1 and 2.2, respectively. Then we present our USP algorithm in Sect. 2.3.

Voltage-controlled single-qubit in semiconductor DQDs
The effective control Hamiltonian of a single-qubit encoded by S-T 0 states in semiconductor DQDs can be written as [62][63][64][65], It is written under the computational basis states: the spin singlet state |0 = |S = (| ↑↓ -| ↓↑ )/ √ 2 and the spin triplet state |1 = |T 0 = (| ↑↓ + | ↓↑ )/ √ 2. Here the arrows indicate the spin projections of the electron in the left and right dots, respectively. σ z and σ x are the Pauli matrices. h accounts for the Zeeman energy spacing of two spins. Considering h is difficult to be changed experimentally [20], here we assume it is a constant h = 1 and set it to be the unit of pulse intensity. We also take the reduced Planck constant = 1 and the 1/ as the time-scale throughout. Physically the exchange coupling J(t) is tunable and non-negative [20]. In addition, if the J(t) is limited to a finite range, so that not to destroy the charge configuration of the DQDs, the leakage of population to the non-computational space will be suppressed and we can study the evolution of the system safely within the Hilbert space spanned by the two bases [29,30,38].
Arbitrary single-qubit states can be written as where θ and ϕ are real numbers that define points on the Bloch sphere. For an initial state |s ini on the Bloch sphere, any target state |s tar can be achieved by successive rotations around the x-and z-axes of the Bloch sphere. In the context of semiconductor DQDs, h and J(t) cause rotations around the x-axis and z-axis of the Bloch sphere, respectively.

Capacitively coupled S-T 0 qubits in semiconductor DQDs
Operations on two entangled qubits are often required in quantum information processing. In semiconductor DQDs, interqubit operations can be performed on two adjacent and capacitively coupled S-T 0 qubits. In the basis of {|SS , |ST 0 , |T 0 S , |T 0 T 0 }, the Hamiltonian can be written as [21,23,28,63,66,67], where h i and J i are the Zeeman energy spacing and exchange coupling of the ith qubit respectively. J 12 ∝ J 1 J 2 refers to the strength of Coulomb coupling between two qubits. J i > 0 is required to maintain the interqubit coupling all the time. For simplicity, we take h 1 = h 2 = 1 and J 12 = J 1 J 2 /2 here.

Universal state preparation via deep reinforcement learning
Our target is to drive arbitrary initial states to a certain target state with discrete pulses.
The control trajectory is discretized as a piece-wise constant function, i.e., the pulses have rectangular shapes [34]. While, the conclusion still holds if one take into account the finite rise time of the pulses that can be available with an arbitrary wave generator [23,28,63,66,68] in actual experiments: we need just alter the parameters of the pulses generated by our algorithm slightly as demonstrated in [26] and [29]. So, it is a reasonable simplification to perform the optimization with ideal, zero rise time pulses.
The strategy used here is to generate this control trajectory with the deep Q network algorithm (DQN) [69,70], which is an important member of deep RL. The details of the DQN are described in Appendix. Here we just refer it as a neural network, i.e., the Main Net θ .
Our scheme of obtaining a competent Main Net goes as follows: Firstly, a database comprised of numerous potential initial quantum states is divided randomly into the training set, the validation set and the test set. The states in the training set will be used to train the Main Net in turn. The validation set will be utilized to estimate the generalization error of the Main Net during the training process. The test set will be employed to evaluate the Main Net's final performance after training. Secondly, the random-initialized Main Net is initially fed with a sampled initial state s from the training set at step k = 1 and then outputs the predicted "best action" a k ( i.e., the pulse intensity J(t)). According to the current state s and the action a k , calculate the next state |s = exp(-iH(a k ) dt)|s and the corresponding fidelity F = | s tar |s | 2 . The fidelity F indicates how close the next state is to the target state. Then the next state s is fed to the Main Net as the new current state with the step k ← k + 1. The reward will envelopes the fidelity r = r(F) and be used to train the Main Net. Then repeat the above operations until the episode terminates when k reaches the maximum step or the fidelity excesses a certain satisfactory threshold. Correspondingly, the control trajectory is constructed by these predicted actions orderly. After more than The overview of this training and pulses designing process is pictured in Fig. 1. And a full description of the training process is given in Algorithm 1.

Results
In this section, we compare and contrast the performance of our scheme with two sophisticated optimization approaches based on gradient for the USP problem. As demonstration, we consider the preparation of a single-qubit state |0 and a two-qubit Bell state We stress that our USP scheme is applicable to any other target states as long as it is trained specifically.

Universal single-qubit state preparation
Now we consider the preparation of the single-qubit state |0 by using our USP algorithm. Considering the challenges to implement pulses with continuous intensity, our scheme takes only several discrete allowed actions on J(t): 0, 1, 2 or 3 with duration dt = π/10. We stress that these settings are made experimentally and can be further tailored as required. The maximum total operation time is limited to be 2π , which is uniformly discretized into 20 slices. The Main Net consists of two hidden layers with 32 neurons each. The reward function should be set to allow a growth in itself as the fidelity increases, thus the Main Net can be inspirited to pursue a higher fidelity. In practice, we find that the function r = F works well. For training the Main Net and evaluating the performance of our algorithm, we sample 128 points on the Bloch sphere uniformly with respective to the θ and the ϕ as the initial states. Both the training and validation sets contain 32 points, while the test set consists of the remaining 64 points. The details of all hyperparameters for this algorithm has been listed in Table 1.
As shown in Fig. 2(a), after about 33 episodes of training, the average fidelity and total reward over the validation set have no obvious increase as the episode grows up, which implies the Main Net converges and can be used to implement the USP task.
To evaluate the performance of our algorithm, we compare it with two alternatives: the GRAPE and the CRAB. Considering that the efficiency of an algorithm is also an important metric when facing a large number of different preparation tasks, we plot their preparation fidelities of state |0 versus the corresponding runtime of designing the control trajectories in Fig. 2(b). The average fidelitiesF = 0.9968, 0.9721, 0.9655 and the average pulses designing timet = 0.0120, 0.0268, 0.7504 with USP, GRAPE and CRAB, respectively. The fidelities of the three algorithms are the maximums that can be achieved within the maximum step. To satisfy the limitation of discrete pulses, for the GRAPE and the CRAB, their continuous control strengths are discretized into the nearest allowed actions when the designing process is completed [36]. Although the two traditional algorithms can achieve high average fidelities after convergence with continuous control pulses,F = 0.9997 for GRAPE andF = 0.9995 for CRAB, they do not perform well in discrete control space. Figure 2(b) shows that our USP algorithm outperforms the alternative optimization approaches both in terms of preparation quality and pulses designing efficiency in discrete control space. Clearly, CRAB algorithm performs the worst, and GRAPE algorithm is in the middle. The average steps to achieve the maximum fidelities are 12.297, 14.109, 13.375 with USP, GRAPE and CRAB, respectively. A trajectory with fewer steps required for a given state preparation task corresponds to a faster control scheme in experiment. Overall, the control trajectories generated by our USP algorithm are better than that of the alternatives.
To show the control trajectory designed by our USP algorithm visually, as an example we plot one in Fig. 3(a), where the position of the initial state on the Bloch sphere is θ = 2π/7, ϕ = 3π/7 and the target state is |0 . It shows that the USP algorithm takes only 6 steps to complete this task. The reason is that the DQN algorithm favors the policy with fewer steps due to the discounted reward (See the details of the DQN described in Appendix).
In Fig. 3(b), we plot the corresponding motion trail of the quantum state on the Bloch sphere during operations. It shows that the final quantum state reaches a position that is very close to the target state |0 on the Bloch sphere and the final fidelity F = 0.9999.

Universal two coupled S-T 0 qubits state preparation
where θ i ∈ {π/8, π/4, 3π/8}. The normalization condition is satisfied for each quantum state represented by these points. The database is divided randomly into the training set, the validation set and the test set with 256, 256 and 6400 points, respectively. As depicted in Fig. 4(a), the Main Net converges after about 700 episodes of training. With 731 episodes of training, the average fidelity of the Bell state preparation over all the test pointsF = 0.9695. The maximum total operation time is taken as 20π and be discretized into 40 slices with pulse duration dt = π/2. In Fig. 4(b), we plot the distribution of the fidelities of the test points under control trajectories designed by our USP scheme in this two-qubit preparing task. The average pulses designing timet = 0.0477 and the average step to complete the preparation tasks is 24.014. It shows that although the fidelities are distributed unevenly between the interval [0.91, 1], the overall performance is good.

USP in noisy environments
In the preceding section, we have studied the USP problem without considering the surrounding environment. However the qubits will suffer from a variety of fluctuations in a practical experiment, which prevents the accessibility of high precision control over the system. There exist works studied the corrected gate operations that employ the additional pulses to counteract the impact of various noises, such as the SUPCODE [26,29]. However they treat different noises equally resulting the designed control trajectories are too long to implement in actual experiment (about 300π of rotation for a single quantum gate). Thus it is worth exploring which noise will lead to the most serious threat to the control accuracy. Then designing the compensating pulses to shorten the total control trajectory using the SUPCODE as well as to improve the physical platform accordingly. Next we will study the performance of the control trajectories designed by our USP algorithm under two main noises leading to stochastic errors in the system Hamiltonian: the charge noise and nuclear noise. Considering that they vary on a typical time scale (∼100 μs) much longer than a gate duration (∼10 ns), we take them constants during the preparation task. We point out that these noises are integrated into the system's evolution after the control trajectories have been designed by our Main Net, which is trained on a clean model. This is a reasonable assumption since normally the environment is unpredictable.
The charge noise stems from the imperfection of the external voltage field, while the nuclear noise comes from the uncontrolled hyperfine coupling with spinful nuclei of the host material [63,71,72]. They can be represented by an additional term δσ z (or δσ x ) in the Hamiltonian (1) for the single-qubit case or by additional terms δ i σ z (or δ i σ x ) in the Hamiltonian (3) for the two-qubit case. Where i ∈ {1, 2} indicate the corresponding qubit and δ (δ i ) are the amplitudes of the noises. In addition, for the two-qubit Bell state preparation, we assume that the amplitudes of the noises on the two qubits are identical.
Average fidelities of the target states |0 and Bell state preparation with control trajectories generated by our USP over all test points versus amplitudes of two noises are plotted in Fig. 5(a) and (b), respectively. It can be seen from Fig. 5, the average fidelities do not change significantly and the control trajectories exhibit robustness against considered imperfections within certain thresholds. We also find that in the analyzed parameter windows the F in nuclear noise are always higher than that in charge noise with the same amplitudes for both single-and two-qubit cases. It reveals that the charge noise leaves the most impact to the preparation tasks.
A meaningful point worth stating is that the best average fidelity can even be obtained in non-zero nuclear noise from Fig. 5(b). That is to say, certain noises can be helpful to boost the fidelity due to subtle adjustments on parameters. A possible explanation may be the limitation of the discrete value in our calculation. We believe that there is still a room for the achievement of better performance by employing more allowed actions and more deliberate Zeeman energy spacing, just as what these noises do. Of course, more sufficient training on Main Net is also helpful for the enhancement of the fidelity.
Given the limitations of quantum computing hardwares presently accessible, we simulate quantum computing on a classical computer and generate data to train the network. Our algorithm is implemented with PYTHON 3.7.9, TensorFlow 2.6.0 and QuTip 4.6.2 and have been run on an 4-core 1.80 GHz CPU with 8 GB memory. Details of the running environment of the algorithm can be found in Availability of data and materials. The runtime for the training process of USP algorithms are about tens of seconds in the single-qubit case and about an hour in the two-qubit case.
after the Agent selecting and performing an action a i chose from the set of allowed actions a = {a 1 , a 2 , . . . , a n } at time t. In return, the Environment also gives a feedback, or immediate reward r to the Agent. A Policy π represents which action the Agent will be chose in a given state, i.e., a i = π(s). The process is defined as an episode in which the Agent starts from an initial state until it completes the task or terminates in halfway.
The total discounted reward R gained in an N -steps episode can be written as [52]: where γ is a discount factor within the interval [0, 1], which indicates that the immediate reward r discounts with the steps increasing. The goal of the Agent is to maximize R, because a greater R implies a better performance of the Agent. Because the discounted r, the Agent tends naturally to get a bigger reward as quickly as possible to ensure a considerable R. To determine which action should to be chose in a given state, we introduce the action-value function, which is also named Q-value [74]: The Q-value indicates the expectation of R, which the Agent will get after it executing an action a i in a given state s under the policy π , and this value can be obtained iteratively according to the Q-values of the next state. Because there are multiple allowed actions can be chosen in each state, and different actions will lead to different next states, it is a timeconsuming task to calculate Q-values in a multi-step process. To reduce the overhead, there are various algorithms used to calculate approximations of that expectation, such as Q-learning [74] and SARSA [52].
In Q-learning, the current Q(s, a i ) value is obtained by the Q-value of the next state's "best action" [74]: where α is the learning rate, which affects the convergence of this function. The part of Q target (s , a ) = r t + γ max a Q(s , a ) is called the Q target value. All the Q-values of different states and actions can be recorded in a so-called Q- Table. With a precise Q- Table, it is easily to identify which action should be chose in a given state. However, on the one hand, we need the best action to calculate iteratively the Q-value; on the other hand, we must know all the Q-values to determine which action is the best. To solve this dilemma of "exploitation" and "exploration", we adopt the -greedy strategy in choosing action to execute, i.e., choose the action corresponding to the current maximum Q-value with a probability of to calculate Q-value efficiently, or choose an action randomly with a probability of 1 -to expand the range of consideration. At the beginning, since it is not known that which action is the best one in a certain state, the is set to be 0 to explore as many states and actions as possible. When sufficient states and actions are explored, that parameter gradually increases with the amplitude of δ until to max , which is slightly smaller than 1, to calculate the Q-values efficiently.
For an Environment with a large number or even an infinite number of states, the Q-Table would be prohibitively large. To solve this "dimensional disaster", we can substitute this table with a multi-layer neural network. After learning, the network will be capable to match a suited Q-value to each action after be fed with a certain state. The deep Q network algorithm (DQN) [69,70] are based on the Equation (7). A network, the Main Net θ , is used to predict the term Q(s, a i ), and another network, the Target Net θis used to predict the term max a Q(s , a ) in Equation (7) respectively. In order to ensure the ability of generalization, the data used to train the Main Net must meet the assumption of independent and identically distributed, i.e. each sample of the dataset is independent of another while the training and test set are identically distributed. So we adopt the experience memory replay strategy [70]: the Agent could get an experience unit (s, a, r, s ) at each step. After numerous steps, the Agent will collect a lot of such units that can be stored in an Experience Memory D with capacity of Memory size M. In the process of training, the Agent randomly samples batch size N bs of experience units from the Experience Memory to train the Main Net at each time step. Notice that to ensure the stability of the algorithm only the Main Net is trained in every time step by minimizing the Loss function: where N bs is the sample batch size through mini-batch gradient descent (MBGD) algorithm [37,69,70]. While the Target Net θis not updated in real time, instead, it copies the parameters from the Main Net θ every C steps. A schematic of this DQN algorithm is shown in Fig. 6.