Performance of quantum kernel on initial learning process

For many manufacturing companies, the production line is very important. In recent years, the number of small-quantity, high-mix products have been increasing, and the identiﬁcation of good and defective products must be carried out eﬃciently. At that time, machine learning is a very important issue on shipping inspection using small amounts of data. Quantum machine learning is one of most exciting prospective applications of quantum technologies. SVM using kernel estimation is one of most popular methods for classiﬁers. Our purpose is to search quantum advantage on classiﬁer to enable us to classiﬁer in inspection test for small size datasets. In this study, we made clear the diﬀerence between classical and quantum kernel learning in initial state and propose analysis of learning process by plotting ROC space. To meet the purpose, we investigated the eﬀect of each feature map compared to classical one, using evaluation index. The simulation results show that the learning model construction process between quantum and classical kernel learning is diﬀerent in initial state. Moreover, the result indicates that the learning model of quantum kernel is the method to decrease the false positive rate (FPR) from high FPR, keeping high true positive rates on several datasets. We demonstrate that learning process on quantum kernel is diﬀerent from classical one in initial state and plotting to ROC space graph is eﬀective when we analyse the learning model process.


Introduction
Quantum machine learning (QML) is one of most exciting prospective applications of quantum technologies [1][2][3][4][5]. Kernel estimation is one of the methods to estimate the whole distribution from a finite number of sample points and a typical example of nonparametric estimation that cannot be expressed by parametric estimation. The inner product space is used for discrimination. Therefore, kernel estimation matches the mapping to the Hilbert space, and it is promising method for SVM as classifier. Support vector machine (SVM) is the most often used method in various machine learning [6][7][8][9]. This method is based on statistical machine learning, which allows the construction of training models with relatively little data. In recent years, kernel estimation SVM has been widely used as one best method [10][11][12]. Kernel SVM is widely used for pattern recognition and other imaging applications, as we can separate non-linear feature spaces by using inner products.
For many manufacturing companies, the production line is very important. In recent years, the number of small-quantity, high-mix products have been increasing, and the classification of good and defective products must be carried out efficiently. The classification includes image data, text, and sound. Image classification is widely used in remote sensing [6], biological inspection [13][14][15], building and civil engineering [16] and manufacturing [17][18][19]. The inspection of defective products is a very important issue in the inspection process in the manufacturer. The learning model of two-class classification is used in such inspection processes. Recently, we have limited training size (good and defective products), as many products have been produced in small quantities and in many varieties. Therefore, we need a machine learning model that enables limited and small data for classification.
However, we have two issues with the kernel estimation. One is calculation cost. We need a huge calculation cost as the embedding function into the feature space increases dramatically when the feature volume increases. The other is the limitation of the embedding function. We must treat a complicated function when we use the kernel trick in SVM. As a means of solving the above problems, there are two attempts to use kernel estimation to embed feature maps with quantum entanglement in the Hilbert space. One is the quantum kernel SVM, which introduced Z-ZZ feature maps as quantum entanglement in an exponentially large feature space [20]. The other is the kernel estimation neural network, and they propose methodology for assessing potential quantum advantage in learning tasks [21].
Our purpose is to obtain a highly accurate learning model that classifies small training size imaging data in shipping inspection.
In this work, we investigated the difference between classic and quantum learning process. In the Sect. 2, we describe relative work using quantum kernel estimation. In the Sect. 3, we explain preparation of dataset and quantum circuits we used in this work. In the Sect. 4, we denote simulation results by using quantum simulator and actual machine. First, we look into an effect of entanglement using Pauli-, Pauli-ZZ feature map compared to classic kernel estimation as Sect. 4.1. Second, we check learning process using accuracy and F1-score as evaluation index as Sect. 4.2. Thirdly, we propose plotting onto ROC space graph using confusion matrix as Sect. 4.3. Fourthly, we described first trial using our product as Sect. 4.4. Generally evaluation index: AUC is used in ROC graph [22][23][24][25]. Here, we use new plotting method that is different from method used conventional ROC graph. In the Sect. 5, we discuss the meaning of plotting onto ROC space graph. In the Sect. 6, we conclude our work and describe future outlook.

Related work
We described two related works with quantum kernel estimation. One is kernel estimation SVM, and the other is kernel estimation neural network.
Two quantum algorithms on a 5-qubits superconducting processor are proposed and implemented experimentally to solve cost issues described above [20] in 2019. To do speed up for cost problem, they though utilization of an exponentially large quantum state space through controllable entanglement and interference.
One method is to implement the quantum variational classier builds as variational quantum circuits on the processor [26,27] and the other method is to estimate the kernel function and optimize the classier directly by using quantum kernel estimator [28]. They pro-posed Z-ZZ feature map as the kernel function in the quantum circuits. This feature map use combination of Pauli-Z feature map and ZZ feature as quantum entanglement.
A methodology for assessing potential quantum advantage in learning tasks was developed [21] in 2021. They referred that classical machine learning models can be competitive with quantum models with the help of data even if they are tailored to quantum problems. The scheme is explained by the cartoon of the geometry (kernel function) defined by classical and quantum ML models.
They propose a projected quantum model as shown in the cartoon that provides a simple and rigorous quantum speed-up for a learning problem in the fault-tolerant regime. For near-term implementations, they use 30-qubits actual gate-based quantum computer for demonstrating quantum advantage.
We focus feature map with/without entanglement on quantum kernel circuit learning, with reference to above research.

Preparation of datasets and circuits
The conventional datasets we used are Iris, heart disease, and wine. The summary of each dataset is shown in Table 1. Heart disease is a two-class classification dataset with attributes of 13. Wine is a three-class classification dataset with attributes of 13. Iris is a three-class classification dataset with attributes of 4. We create two class datasets Iris_2 with attributes of 4 form original Iris dataset (we call it Iris_3), which consist of versicolor and Virginia. Using these data, we can compare two-class classification with three-class classification with attributes of 4 and 13. Figure 1 shows quantum circuits diagram for quantum kernel SVM. Fig. (a), (b), (c) and (d) stand for quantum circuits diagram, detailed quantum circuits diagram using Y Paulifeature map, Z Pauli-feature map and Y-ZZ feature map. Y-ZZ means quantum entanglement feature map as described later. Here, we use classical and quantum hybrid systems. We perform training and prediction on the classical SVM by using the gram matrix calculated on the quantum circuit. The distance between the classical data x and x is calculated by the kernel κ(x, x ). By means of a nonlinear mapping ϕ(x) embedding the data into the quantum feature space, it can be expressed in the feature space as follows.
We prepare ϕ(x) = S(0)|0 as a data encoding from classical to quantum data, first. To obtain the inner product κ(x, x ), we prepare |ϕ(x) = S(x ) † S(x)|0 as the initial state of the quantum circuit. The probability of measuring on |0 for all qubits is as follows.  Here S(x) is the inner product between the quantum encoded data using quantum kernel estimation. Each feature map is then embedded into the inner product to optimize the parameters. The matrix component of the entire Gram matrix is obtained from a combination of the inner product. The parameters of the kernel estimation are optimized using rotation gates with/without entanglement in Eq. (2). We use rbf as classical kernel of SVM in this work.
Gate-based quantum simulator and computer were used. The simulation was performed by using IBM qiskit and confirmed by blueqat. Actual 5-qubits quantum computer used in this work was ibmq Bogota. The number of shots is 1024, the number of seed is 10,598.
We checked the testing accuracy (accuracy) and F1-score as evaluation indices when the ratio of training size was changed. Here, training size started from 6 for Iris_2, 9 for Iris_3, 8 for heart disease, and 9 for wine. As mapping in the Hilbert space, Pauli-X, -Y, and -Z feature maps and X-ZZ, Y-ZZ and Z-ZZ feature maps with entanglement were used.

Effect of entanglement
To compare quantum kernel with/without entanglement, we embedded each feature map. Figure 2 shows the effect of each feature map on the quantum kernel SVM (qkSVM) on the heart disease and wine datasets. Here, we show a comparison between the classical kernel SVM (ckSVM) and qkSVM embedded with Y, Z, X-ZZ, Y-ZZ, and Z-ZZ feature maps (qkSVM with Y, Z, X-ZZ, Y-ZZ, and Z-ZZ) on heart disease and wine datasets with attributes of 13.
As the training size become larger, the accuracy increases. These accuracies become more than 0.8 at a training size of 72 for ckSVM, qkSVM with Y and Z. For all datasets, Figure 2 Effect of each feature map on training accuracy on quantum kernel support vector machine (qkSVM). Here we used Heart disease and Wine datasets. Here, Ac and TS stand for accuracy and Training Size. Classic, Y, Z, X-ZZ, Y-ZZ, and Z-ZZ stand for each feature maps. After measurement, the data are inserted into classical machine learning the accuracy of qkSVM with X-ZZ and Z-ZZ with entanglement was lower than that of qkSVM with Y-ZZ.
When the training size for heart disease was 200, the accuracy of qkSVM with Y and Z was 0.835 and 0.845, that of qkSVM with Y-ZZ was 0.767, and that of ckSVM was 0.806. The values of qkSVM with X-ZZ and Z-ZZ were 0.621 and 0.680, respectively. When the training size of Wine was 108, that of ckSVM, qkSVM with Z and Y were 0.986, 0.971 and 0.986. These accuracy data are almost the same values, which are approximately double the accuracy of qkSVM with X-ZZ, Y-ZZ and Z-ZZ (0.371, 0.557 and 0.371).
From the above, introducing quantum entanglement is not effective on the accuracy when we use datasets of Iris, Heart diseases and Wine in this work. On the other hand, the qkSVM with Y and Z would have the same or better performance than ckSVM. Moreover, the outline of the learning model is thought to build on the range of less than 72.

Learning process
The confusion matrix is important indicator on the classification problem. Accuracy is an indicator of how correct the prediction was. Precision is an index to see how correct what was predicted to be positive. Recall is an index to see how many of the actual positive results could be predicted to be positive. The F1-score is the harmonic mean of Precision and Recall. To analyze the process on the learning model, we had better compare both accuracy with F1-score. Figure 3 shows the relationship between the training size and index on each dataset. Here, the training size is less than 100. In the Iris_2, Iris_3 and Wine datasets, the evaluation index (Accuracy and F1-score) rises dramatically when each training size is less than 20. Moreover, machine learning model using qkSVM shows higher accuracy and F1-score than that using ckSVM except for Heart disease. The values of accuracy and F1-score were almost the same as when the training size was 20 or more. The difference between accuracy and F1-score of qkSVM is in the order of heart disease > Iris_2 > Wine.
We can use within 5 qubits an actual quantum computer (ibmq bogota). Under this limitation, we can calculate the classification of attributes (feature volume) of 4. Experiments were carried out on Iris_2 and Iris_3. The shot number is 1024, and we used the average value on 10 times. The index value performed on the actual quantum computer machine almost coincides with the locus of the simulator.
In the 2-class classification, the accuracy and F1-score of qkSVM on the Iris_2 datasets were almost the same when the training size was 20 or more. When the training size was 60, the accuracy and F1-score of qkSVM with Z became 1.000. In the datasets of heart disease, the order of index was F1-score of qkSVM > accuracy and F1-score of ckSVM > accuracy of qkSVM when the training size was less than 60. Accuracy and F1-score on quantum kernel keep large value compared to classic kernel on training size = 240 (training size: testing size = 08: 0.2). However, the order of the index was accuracy and F1-score of qkSVM > accuracy and F1-score of ckSVM when the training size was 60 or more.
In the 3-class classification, the accuracy and F1-score of qkSVM on the Iris_3 datasets showed a higher value than those of ckSVM when the training size was less than 60. The difference in the index between qkSVM and ckSVM decreased when the training size exceeded 60. In the datasets of Wine, qkSVM shows higher accuracy and F1-score than ckSVM when the training size is less than 20. However, each index of qkSVM and ckSVM become almost the same when the training size exceeds 20. Table 2 shows the training size when the accuracy and F1-score reached 1.000 and the index value when the training size was 80% of the total data size in the case that these indexes did not reach 1.000. When the accuracy and F1-score reach up to 1.000, the training size on qkSVM is less than that on ckSVM. In the case of heart disease, these indices do not reach 1.000, and these indices of qkSVM are higher than those of ckSVM. The reason is why the attributes (feature volume) of the heart disease dataset is larger than that of the Iris_2 dataset and the heart disease dataset of the actual problem is complex compared to the wine dataset of the toy problem.  From the above, the outline of the learning model is formed up to approximately 20 data points. After that, each parameter on the learning model is finely regulated by the Pauli-feature map when the training size increases.
We found that the learning model on qkSVM is constructed by using a smaller training size than that on ckSVM in the initial state. Moreover, we confirmed that the value of accuracy and F1-score on quantum kernel is larger than these on class kernel.

Model construction process
To clarify the learning process in the initial state, we used an ROC space graph that is different from conventional ROC curve. ROC curve was used to analyze the learning model construction on classical machine learning by using the true positive rate (TPR) and false positive rate (FPR). The TPR and FPR are obtained from the confusion matrix. Figure 4 shows the plotting onto ROC space. First, we observed plotting FPR and TPR on heart disease onto the ROC space. In the case of ckSVM, the plotted point moves from   We measured the training accuracy in addition to the testing accuracy on heart disease. The results are shown in Table 3. The training sizes are 12, 72, and 120, as the same settings used in the laser chart in Fig. 2. The training accuracy of ckSVM increases gradually as training size become larger, as well as testing accuracy. In other words, the learning model is gradually constructed as the number become larger. On the other hand, quantum kernel learning using qkSVM with Y and Z feature maps showed high training accuracy around 1 on the training size of 12, 72 and 200. High training accuracy was maintained even if the training size become larger. This trend was different from the testing accuracy. We found that this result indicates that the construction of the learning model is completed with the data area used for training in the case of qkSVM.
As described above, we found that the learning process on ckSVM is different from that on qkSVM. The learning process on qkSVM always maintains a higher TPR, and the learning model starts with a high TPR and a high FPR in the initial states. Keeping the TPR of almost 1 while decreasing the FPR, a learning model on qkSVM is constructed.

First trial on actual products
So far, we have run simulations on a balanced toy dataset. Real products include imbalanced datasets. Table 4 shows the results of testing accuracy applied to defect detection of industrial products in our factories.
Although the number of original image data exceeds 10,000, the defective product rate is less than 1%. Of these, 400 (good product of 300 and defective product of 100) were extracted. Image processing was performed as preprocessing for machine learning. Then, we selected 10 features by using principal component analysis (PCA) of classical machine learning. Here, cumulative contribution of the PCA is greater than 80% when the attributes (feature volume) is 10. After that, we performed classification with ckSVM and qkSVM Table 4 Testing accuracy applied to defect detection of industrial product. Prod. and (Q)ML stand for industrial product and (quantum) machine learning. Defective product is less than 1% on developing products. ⇒ means preprocessing. After preprocessing, we select 10 features by principal component analysis (PCA). C, Z stand for ckSVM and qkSVM using Z feature map  Table 4, we confirmed that the plots of (FPR, TPR) almost lies on the classical and quantum trajectories, respectively as shown in Fig. 4. We have just started a trial of detecting defective product in factory products. From now on, we would like to accumulate data and obtain reliable knowledge.

Discussion
The calculations on quantum computers are characterized by quantum operations in quantum circuits. The quantum operation get started with encoding from classical data to quantum data. Encoding is performed by projection onto the Hilbert space. This projection stands for mapping. The encoding from classical data to quantum data corresponds to ϕ(x) = S(0)|0 as shown in Eq. (1). Figure 5 shows Hypothesis: Quantum & classical Learning Process on ROC space. The dotted green line stands for random learning model. As the learning make progresses, AUC curve become dashed green line when AUC exceeds over 0.8. And ideal learning model become continuous green solid line. When we calculate FPR and TPR using each TP, FN, FP and TN, random learning model is at FPR = 0.5 and TPR = 0.5 and ideal learning model is red filled star sign. The results obtained on Sect. 4 means that the true pos-itive rate become large and the false positive rate decrease (green arrow in the figure) as the learning progress. Therefore, our result of classical simulation is reasonable.
We observed that the learning process got started with a high TPR and FPR as quantum kernels are used. Keeping on high TPR, the FPR become small as learning progress (orange arrow in the figure). We are thinking as follows. Classical data are transformed into quantum data and embedded in a Hilbert space. Then, the learning model is completed within the randomly set training data range as can be inferred from the results in Table 3. At initial learning, the training data is randomly and sparsely scattered. Therefore, the model building on quantum kernel learning can be considered to have a wide tolerance. As a result, the model is likely to exhibit high TPR and FPR in the initial learning process. Then, as training progresses, the density of the data space increases, so the learning model is expected to be less tolerant and have a lower FPR.
Let's think about quantum operation based on a maze. There are various routes in the maze. When entering the entrance, all routes including the route to the correct exit are listed as candidates at the same time. This is considered to correspond to superposition. After that, each route is bifurcated, and interfered. As the result, the routes that are not correct are weakened, and the routes that are correct are strengthened, which become the highest probability. After that, the quantum state collapses in the measurement. The above description is an image of quantum calculation.
We are thinking that the fact that FPR and FPR start near 1 means that all cases are candidates at the same time, so learning starts from the superposition state.

Conclusion and outlook
We investigate the difference between classic and quantum kernel learning by using several evaluation indices. The simulation results got several suggestions. In general, they say that quantum entanglement contributes to improved accuracy in classification as we can embed more future map in the Hilbert space. However, we could not show the effect of quantum entanglement on several datasets we selected. Our results suggested that the quantum learning model building process is different from classical one on several datasets we selected. From these results, we plotted onto ROC space according to training size. As the result, we could recognize the difference between classical and quantum kernel learning in initial state. Therefore, we propose utilization of ROC space graph to investigate initial learning process. We made clear the initial behavior of quantum learning process from ROC space graph.
Recall rate (TPR) is that we want to avoid the risk of erroneously predicting a defective product (Positive) as a good product (Negative), and to classify cases where there is a suspicion of a defective product (Positive) without omission. High TPR is important in such a case. Plotting in the ROC space is also important from such a point of view. Our purpose is to efficiently detect defective products for high-mix, low-volume production. We believe that it can be an evaluation tool suitable for this purpose.
Quantum computers and classical computers are expected to coexist in the future. It is important to distinguish between computations that quantum computers are good at and computations that classical computers are good at. When we look into implementation of quantum machine learning to factory, it is necessary to select calculations that quantum computers are good at. To do so, we need to accumulate results calculated by quantum computers to determine which computations they are good at. From now on, we would like to build a useful classifier through trials.