As the EDA highlights a clustering effect, we propose different clustering approaches that use different data for angle recommendations. Namely, we describe first using the angle values directly for building clusters serving as angles to try. Then, we switch to using instance-related features. Finally, for the unweighted case, we use graph auto-encoders whose outputs can be used for clustering instead of computing graph features. In the following, we detail each clustering approach for flexible angle recommendation.

### 4.1 Identifying clusters of angles or problem instances

We first considered clustering using angle values. Given a database of optimal angles for *Q* problem instances \(\{ I_{1}, \ldots , I_{Q} \}\), \(\{ (\gamma ^{\ast} , \beta ^{\ast})_{1}, \ldots ,(\gamma ^{\ast} , \beta ^{\ast})_{Q} \}\), this can be seen as computing or selecting a good set of angle values the database to apply on new instances. In this case, we do not use the problem instances during clustering. Given a user-specified number of angles to be tested *K*, this set of angle values is then applied to new QAOA circuits. To specify them, we can use a clustering algorithm on the database \(\{ (\gamma ^{\ast} , \beta ^{\ast})_{1}, \ldots , (\gamma ^{\ast} , \beta ^{\ast})_{Q} \}\). For instance, K-means [25] will output centroids to use directly as angle recommendations for QAOA on new instances. The K-means algorithm aims to partition a set of *n* data points \(x_{i}\) into *K* disjoint clusters *C*, characterized by the mean/centroid of the points within a cluster, denoted \(\mu _{j}\). The partition \(P = \{P_{1}, P_{2}, \ldots , P_{K}\}\) (\(\forall i\neq j \in [1..K]\), \(P_{i} \neq \emptyset \), \(P_{i} \cap P_{j}=\emptyset \), \(\cup _{i} P_{i} = \{x_{i}\}_{i=1}^{n}\)) is chosen by minimizing the within-cluster sum of squares, i.e., \(\operatorname{arg\,min}_{P}\sum_{i=1}^{K}\sum_{x\in P_{i}}||x - \mu _{i}||^{2}\), where the centroid \(\mu _{i} = |P_{i}|^{-1}\sum_{x \in P_{i}}x\). The algorithm iteratively updates the centroids by assigning each data point to its nearest centroid and computing the mean, until convergence.

To incorporate knowledge from instances when recommending angles, we change the data fed to the clustering algorithm. We distinguish computing instance features from learning an embedding, that is a user-defined *F*-dimensional representation or encoding of the instances as data. We denote an encoding of an instance \(I_{t}\) as \(f(I_{t})\). The angle recommendation framework using a clustering algorithm for such instance representation is presented in Algorithm 1. First, clusters are learned from the encodings extracted from training data. Then, we find the instances in the database that are the closest in distance to the clusters, and their corresponding optimal angles .^{Footnote 1} The latter are then used for QAOA circuits on new instances, from which we keep the best QAOA output.

### 4.2 Instance encodings

In this work, we show two main approaches to encoding the instances for clustering. First, we computed a set of features following [17, 26]. Such features were used in [26] to decide among classical heuristics to solve MaxCut and QUBO problems. Inspired by [26], the features were also used for choosing when to apply QAOA against a classical approximation algorithm [17]. *For Erdős-Rényi graphs, we took the graph density, the logarithm of the number of nodes and edges, the logarithm of the first and second-largest eigenvalues of the Laplacian matrix normalized by the average node degree and the logarithm of the ratio of the two largest eigenvalues. For QUBOs, we reduced them to the MaxCut formulation and used the logarithm of the number of nodes, and the weighted Laplacian matrix eigenvalues-based features.*

We also show how to use graph embeddings using Graph Neural Networks (GNNs) [27], avoiding the need for the user to have to compute the features. We employ the Variational Graph Auto-Encoders (VGAE) [28]. This technique only works on unweighted graphs by its design principle. Consequently, we only applied it to the MaxCut instances later in this work. a VGAE learns latent embeddings \(\mathbf{Z} \in \mathbb{R}^{N\times F}\) where *F* is the dimension of the latent variables and *N* the number of nodes. Given the adjacency matrix *A* and nodes feature vector *X*, the model outputs the parameters of a Gaussian distribution *μ*, *σ* for the latent representation generation. We feed to the model the Erdős-Rényi graphs, and we add as node features the degree of the nodes. Once learning is completed, we compute the embeddings by a common average readout operation [27, 29]. The latter operation can be defined as averaging the node embeddings for a graph with vertex set \(\mathcal{V}\) \(\frac{1}{|\mathcal{V}|}\sum_{n\in \mathcal{V}}Z_{n}\). This allows having a fixed dimension *F* for the encoding to be used by a clustering algorithm.

Having defined different strategies for clustering, we apply them to the data we generated and compare their performances. In the following section, we present our results obtained by taking a Machine Learning approach, starting from a simple baseline and cross-validating each method.

### 4.3 Results

In this section, we apply the above-mentioned proposed strategies to the generated data where EDA revealed different areas of concentration. As the first baseline for angle setting strategy, we experiment with simple aggregation of angle values (median and average). Then we follow this up by K-means by varying the number of clusters from 3 to 10 as the underlying clustering algorithm. Finally, we change the K-means data to cluster based on instance encodings instead of angle values. We computed first a set of graph features that were used in a previous study [30]. Then we investigate graph autoencoders to learn the encodings of the Maxcut instances. We cross-validate each method using 5-fold cross-validation where we report the ratios \(\frac{(C_{\mathrm{opt}} - E_{\gamma ,\beta} (C))}{(C_{\mathrm{opt}}- E^{\mathit{cluster}}_{\gamma ,\beta} (C))} \) on test instances. A value higher than 1 would mean that the average cost yielded by clustering has improved over the one found by optimization. We also consider the case where one trains on smaller instances to apply to the bigger ones.

#### 4.3.1 From angle values

As simple baseline, we compute the average and the median of the optimal angles from the database \(\{ (\gamma ^{\ast} , \beta ^{\ast})_{1}, \ldots ,(\gamma ^{\ast} , \beta ^{\ast})_{Q} \}\). From depth-aggregated results, averaging the angle values yielded a median ratio of 0.524 for MaxCut and 0.672 for QUBOs, while taking the median values increased it to respectively 0.950 and 0.941. This can be explained by the fact that the median value is statistically more robust than the mean when handling data sets with large variability.

As expected with K-means, increasing the number of clusters yielded better median ratios. With \(K=10\), the median ratios are 0.998 and 0.985 on each dataset, a less than 1–2% reduction in performances w.r.t. the optimal angles. Figure 4 shows the improvement with increased number of clusters. We observe also that with increased depth, median ratio performances are reduced. We conjecture that, when the dimension of the parameter space increases, more clusters are naturally needed to ensure a sensible recommendation.

Also, such a deterioration of performance w.r.t. circuit depth is more substantial on the QUBO instances than on the MaxCut ones, which can be explained by the clustering patterns in the MaxCut scenario being more significant and regular (Fig. 3). In addition, this observation suggests that for future work, for dense QUBO instances where the cluster center is not representative for all points pertaining to it, it is more reasonable to take a supervised learning method, which takes the problem instance as input as predicts the optimal angle values.

We also observed that, for the MaxCut problem, the cluster centroid of K-means can be quite distant from the data points when the number of clusters is small and the circuit depth is high. Particularly, this phenomenon deteriorates the median ratio by ca. 30% for 3 and 4 clusters with \(p=3\). Hence, we decided to take the closest data point to the centroid in each cluster as the recommendation, which solves this issue. For QUBOs, using the cluster centroids directly yields better results.

Overall, increasing the number of angles attempted will improve the quality of the QAOA output. Clearly, the results with less than 4 clusters present examples where the ratio is low, worsening the median performances. For instance, with 3 clusters on QUBOs, the median ratio is 0.915. In the context where the budget of quantum circuit calls is very limited, this could be problematic and call for more robust approaches. To this end, we consider using instance features for clustering.

#### 4.3.2 From instance encodings

To witness whether using instance features can improve the quality of clustering, we divided the ratios obtained with instance features by the ones using angle values. We show these results in Fig. 5 and Fig. 6 where we can clearly see better ratios with less than 4 clusters, and similar results on average otherwise.

As for learned encodings or embeddings with auto-encoders, the GNN model configuration we use is the same two-layer graph convolutional layer as [28]. Namely, the first one has 32 output-dimension using the ReLU activation function. This is followed by two 16-dimensional output layers for the generation of the latent variables. We train using Adam with a learning rate of 0.01 for 100 epochs and batch size set to the dataset size. Our implementation uses the Deep Graph Library (DGL) [29]. The embeddings obtained by averaging are of dimension \(F=16\). This allows having a fixed dimension for the encoding as input of the same K-means strategy described above. We observe in Fig. 7 that the results are similar to the ones obtained using instance features. Yet, in some instances, we see better results. Hence, many clustering results can be combined to improve the performances in ratios canceling each other weaknesses at the cost of trying more angles to find the best ones. As future work, we could also decide which heuristic to use depending on a given test instance by using a ML model.

Finally, our approaches can save numerous circuit calls compared to de novo optimization. The median numbers of circuit calls for the BFGS runs giving the best QAOA angles were 56, 150, 320 for each depth respectively on MaxCut and 44, 132, 252 for QUBO, while in the cluster approach, the number of calls is always the cluster size, which is considerably smaller than the cost of BFGS. Instance size does not seem to affect the number of circuit calls by BFGS. In our approaches, we limited circuit calls to 10 and we do not need multiple restarts.

### 4.4 Aggregating results

Following the presentation of the different clustering approaches, we compare their performances to determine which approach works best. We propose to take the Empirical cumulative distribution functions (ECDF) of the ratios as the performance measure to compare those different approaches. Given a sample \(\{r_{i}\}_{i=1}^{R}\) of the ratios and a value of interest \(t\in [0, 1]\), ECDF is the fraction of the sample points less or equal to *t*: \(F(t) = \frac{1}{R} \sum_{i} \mathbf{1}_{[0, r_{i}]}(t)\), where **1** denotes the indicator function, which returns one only if \(t\in [0, r_{i}]\) and zero otherwise. They enable us to aggregate the results of the different numbers of clusters and depths. A better method will have more proportion of higher ratios, resulting in an ECDF curve located more to the right. From Fig. 8, we observe that *using instance encodings is more successful in yielding better angles than using the angle values*. This is also witnessed in Fig. 9 with increased depth and a low number of clusters. Also, VGAE seems to be slightly better than instance features on the MaxCut problems. However, these methods can complement each other, especially as we do not need to increase dataset size. Hence, combining them at the cost of circuit calls becomes an option for running QAOA, as we showcase with RQAOA in the next section.

### 4.5 Case when test instances are bigger than training instances

One important consideration of these methods is to analyze scaling. This is relevant in settings where one is interested in solving larger instances given small ones. In our case, we apply these approaches in the case \(K=3\) by a 60–40% train-test split. From Fig. 10 and 11, we find similar conclusions with respectively VGAE on MaxCut and instance features on the QUBO problems yielding better results. Note that we did not use the logarithm of the number of nodes and edges as features when using instance features as the values between training and test are too different.