Deep reinforcement learning for universal quantum state preparation via dynamic pulse control

EPJ Quantum Technology

Table 1 List of hyperparameters for USP

Parameters∖Target state	\|0〉	Bell state
Allowed actions a (J(t))	0, 1, 2, 3	^a
Size of the training set	32	256
Size of the validation set	32	256
Size of the test set	64	6400
Batch size \(N_{bs}\)	32	32
Memory size M	20,000	40,000
Learning rate α	0.01	0.0001
Replace period C	200	200
Reward discount factor γ	0.9	0.9
Number of hidden layers	2	3
Neurons per hidden layer	32/32	256/256/128
Activation function	Relu	Relu
ϵ-greedy increment δϵ	0.001	0.0001
Maximal ϵ in training \(\epsilon _{\max}\)	0.95	0.95
ϵ in validation and testing	1	1
\(F_{\mathrm{threshold}}\) per episode	0.999	0.999
\(\mathrm{episode}_{\max} \) for training	33	731
Total time T	2π	20π
Action duration dt	π/10	π/2
Maximum steps per episode	20	40

^a The allowed actions of two-qubit operations satisfy \(\{(J_{1},J_{2})|J_{1},J_{2}\in \{1, 2, 3, 4, 5\} \}\).