Fock State-enhanced Expressivity of Quantum Machine Learning Models

The data-embedding process is one of the bottlenecks of quantum machine learning, potentially negating any quantum speedups. In light of this, more effective data-encoding strategies are necessary. We propose a photonic-based bosonic data-encoding scheme that embeds classical data points using fewer encoding layers and circumventing the need for nonlinear optical components by mapping the data points into the high-dimensional Fock space. The expressive power of the circuit can be controlled via the number of input photons. Our work shed some light on the unique advantages offers by quantum photonics on the expressive power of quantum machine learning models. By leveraging the photon-number dependent expressive power, we propose three different noisy intermediate-scale quantum-compatible binary classification methods with different scaling of required resources suitable for different supervised classification tasks.


I. INTRODUCTION
Machine learning approaches such as artificial neural networks are powerful tools for solving a wide range of problems including image classification and regression. However, the scalability of machine learning implemented using general-purpose electronic circuits is limited by their high power consumption and the end of Moore's law. These issues motivate the pursuit of dedicated hardware for machine learning including photonic neural networks [1][2][3][4] and quantum circuits [5][6][7][8][9][10][11].
The combination of ideas from the photonic and quantum machine learning communities may enable further speed-ups and novel functionalities [12][13][14][15][16][17][18][19]. For example, both classical and quantum photonic neural networks are presently limited by the difficulty of incorporating nonlinear activation functions. This challenge can be circumvented using the kernel trick, in which the input data is mapped into a high-dimensional feature space where simple linear models become effective [13,[20][21][22]. The simplest quantum feature map based on repeated application of data-dependent single qubit rotations is already sufficient to serve as a universal function approximator [23][24][25].
Despite progress in various aspects of near-term quantum machine learning algorithms [26] including experimental realizations [16,[27][28][29][30], proposals for various platforms [28,31] and studies of statistical properties of quantum machine learning models [32][33][34], the encoding of input data is still a significant bottleneck for (quantum) photonic machine learning hardware. For example, the expressive power of quantum circuits based on parameterized single qubit rotations is limited by the number of encoding gates used [23,24]. Similarly, some existing quantum machine learning algorithms with proven FIG. 1. Circuit diagram of parameterized linear quantum photonic circuit with m-spatial modes and encoding data x using a single phase shifter. The expectation value with respect to observables of photon-number resolving (PNR) or threshold detectors can be written as a Fourier series circuit blocks with one data encoding block sandwiched between them. We show that for a fixed number of encoding phase shifters, the expressive power of the parameterized quantum circuit is improved by embedding the classical data into the higher-dimensional Fock space. This enables the approximation of classical functions using fewer encoding layers while circumventing the need for nonlinear components. The origin of this improved encoding efficiency is that each phase shifter simultaneously uploads the input data onto multiple Fock basis states simultaneously.
Similar to Ref. [24], n-photon quantum machine learning models can be expressed as a Fourier series where Ω n ∈ N is the frequency spectrum and {c ω } are the Fourier coefficients that depend on trainable circuit block's parameters Θ = (θ 1 , θ 2 ) and observable's parameters λ. The expressive power of the Fourier series is determined by two components: the spectrum of frequencies ω, and the Fourier coefficients c ω . We show that the frequency spectrum of the circuit can be controlled by the number of input photons. Thus, a rich frequency spectrum can be generated by providing sufficient number of input photons to linear QPCs with a constant number of spatial modes. In contrast, qubit-based circuits require deeper or wider circuits to increase the size of their frequency spectrum. When generalized to arbitrary input states and observables the QPCs can also generate arbitrary set of Fourier coefficients that combine the frequency dependent basis functions e iwx , allowing them to approximate any square-integrable function on a finite interval to arbitrary precision [24,48,49].
As an application of the parameterized linear quantum photonic circuits, we consider three different machine learning approaches for supervised data classification: (1) A variational classifier based on minimizing a cost function by training the circuit parameters. (2) Kernel methods, which employ fixed circuits, with training carried out on observables only. (3) Random kitchen sinks, which use a set of random circuits to approximate a desired kernel function. Each of these methods has different scaling with the dimension of the data and number of training points used, and so each is better-suited to different types of supervised learning problems.
The outline of this paper is as follows. Sec. II introduces our proposed linear quantum photonic circuit architecture and analyzes how its expressive power depends on the number of spatial modes and input photons. Next, Sec. III illustrates the photon number-dependent performance of the circuit for supervised classification problems. Sec. IV concludes the paper.

II. PARAMETRIZED LINEAR QUANTUM PHOTONIC CIRCUIT MODEL
To demonstrate the Fock state-enhanced expressive power of linear quantum photonic circuits, in this Section we consider the encoding of univariable functions onto circuit's output. For simplicity we consider the circuit architecture illustrated schematically in Fig. 1, consisting of a single data-dependent encoding layer S sandwiched between two trainable beam splitter meshes W (1,2) , described by the unitary transformation where Θ = (θ 1 , θ 2 ) parameterizes transformations applied by trainable beam splitter meshes and x is the input data. The n-photon quantum model (circuit's output) is defined as the expectation value of some observable M(λ) with respect to a state prepared via the parameterised linear QPC, where and λ parameterizes the observable. We consider measurements made using either photon number-resolving (PNR) detectors or single photon (threshold) detectors, corresponding to M being diagonal in the Fock state basis with d or d distinct parameterized eigenvalues {λ j ∈ λ} d ( ) j=1 , respectively. The multi-mode Fock state unitary transformation U(x, Θ) is constructed from permanents of submatrices of the m-mode linear transformation matrix U (x, Θ) = W (2) (θ 2 )S(x)W (1) (θ 1 ) using the scheme of Ref. [50] with W (i) as the programmable transfer matrix, describing the universal multiport interferometer that realizes arbitrary linear optical input-output transformations [51][52][53]. Each trainable unitary W (i) (θ i ) is parameterized by a vector θ i of m(m − 1) phase shifter and beam splitter angles constructed using the encoding of Reck et al. [51]. The data encoding block S(x) employs a single tunable phase shifter placed at the first spatial mode.
A. n-photon quantum models as Fourier series In this Section we will show how to express the nphoton quantum models as a Fourier series. For simplicity, we consider arbitrary unitary operations W(θ) = W, an arbitrary parameterized observable obtained using PNR detectors M(λ) = M, and n (i) = |n, 0, . . . , 0 as the initial Fock state. The component of the output quantum state with photon numbers n (f ) can be written as [24] where the summation runs over the basis of d = n + m − 1 n Fock states corresponding to different combinations of n photons in the m spatial modes. The data encoding block imposes a phase shift proportional to the number of photons in the first mode following the first beam splitter mesh.
The output of full model Eq. (3) is obtained by taking the modulus square of Eq. (4), multiplying by the corresponding observable weight, and then summing over all output Fock basis states. This yields an expression of the form where a n ,n contain the matrix elements from the unitaries W (i) and measurement's observable M, a n ,n = n W * (1) n ,n (i) . (6) This expression can be simplified by grouping the basis function with the same frequency ω = n 1 −n 1 . This gives where the coefficients c ω are obtained by summing over all a n ,n contributing to the same frequency c ω = n ,n n 1 −n 1 =ω a n ,n , with c ω = c * −ω and Eq. (7) is a real-value function, as it should be. The frequency spectrum Ω n = {n 1 − n 1 | n 1 , n 1 ∈ [0, n]} contains all frequencies that are accessible to the n-photon quantum model. For general trainable circuit blocks W (i) (θ i ), measurement observable M(λ) and n-photon Fock state n (i) = B. Expressive power and trainability of linear quantum photonic circuits Since the n-photon quantum model can be represented by a Fourier series, its expressive power can be studied via two properties: its frequency spectrum and Fourier coefficients. The former tells us which functions the quantum model has access to, while the latter determines how the accessible functions can be combined [24]. (a) Parameterized circuit comprising three spatial modes for fitting of Fourier series and binary classification. One encoding phase shifter is used per classical data feature. (b,c) Two spatial mode circuits for implementing kernel-based machine learning using Gaussian kernels with photon number-resolving detectors. Here H denotes a 50-50 beamplitter, with matrix elements the same as the Hadamard transform [54,55]. In other words, (b) is a (c) Mach-Zehnder interferometer. Direct implementation of the kernel method can be done by using the phase shifter to encodes the squared distance between pairs of data points, φ = δ = (x − x ) 2 , while random kitchen sink approach approximate a Gaussian kernel using a set of randomized input features φ = xr,i = γ(wr · xi + br).

Photon-number dependent frequency spectrum
The frequency spectrum can be easily shown to be Ω n = {−n, −n + 1, . . . , n − 1, n}, which is solely determined by the number of photons fed into the circuit. It always contains the zero frequency, i.e. 0 ∈ Ω n , while the non-zero frequencies occur in pairs, i.e. ω, −ω ∈ Ω n . This motivates us to define the size of the frequency spectrum as D n = (|Ω n | − 1)/2 = n to quantify the number of independent non-zero frequencies the n-photon quantum model has access to. In comparison to Ref. [24], where the size of frequency spectrum is determined by the spectrum of the data encoding Hamiltonian, here the size of the frequency spectrum can be increased by feeding more photons into the circuit, while keeping the number of spatial modes and encoding phase shifters constant.
This implies that n-photon quantum models with more input photons can be more expressive, because they have access to more basis functions, and hence can learn Fourier series with higher frequencies. In the limit of n → ∞, i.e. continuous variable quantum systems, nphoton quantum models can support the frequency spectrum Ω ∞ = {−∞, . . . , −1, 0, 1, . . . , ∞} of a full Fourier series, in agreement with Ref. [24]. For a fixed number of input photons, the frequency spectrum can be broadened further using multiple encoding phase shifters, either in parallel or sequentially [23,24] (see Appendix A for details).
As an example, we consider training a linear QPC with three spatial modes shown in Fig. 2(a) using a regularized squared loss cost function. The cost function C(Θ, λ) is constructed using the measurement results and a training set of N desired input/output pairs that is variationally minimized over Θ and λ to learn the function g(x) . Here, f (x, Θ, λ) is the n-photon quantum model in Eq. (9), while λ · λ = i λ 2 i is the sum of squared observable parameters, forming a regularization term with weight α. The regularization term has a two-fold role: it prevents model over-fits, and ensures that the model prediction is not based on output photon combinations that occur with very low probability. The latter is important for QPC-based machine learning models, because the number of measurements required to obtain all of the required expectation values scales with the number of spatial modes and photons.
We train the three mode linear QPC using the gradient-free algorithm in the NLopt nonlinearoptimization python package [56], i.e. BOBYQA algorithm [57] to fit a degree three Fourier series g(x) with x ∈ [−3π, 3π] using input states of up to 3 photons. We consider input states for which each spatial mode contains at most one photon, i.e. n (i) = |100 , |110 , or |111 . Fig. 3(a) shows how the number of observable frequency components and hence the expressive power of the circuit grows with the number of input photons. Perfect fitting is achieved with three photons. In contrast, the frequency spectrum of similar qubit-based architectures cannot fit a degree two Fourier series using a single encoding block, requiring either a deeper or wider circuits with multiple encoding gates [23,24]. n=−3 cne −nix with coefficients c0 = 0.2, c1 = 0.69 + 0.52i, c2 = 0.81 + 0.41i and c3 = 0.68 + 0.82i using a three mode circuit with photon number-resolving (PNR) detectors and different input Fock states, i.e. |100 , |110 , and |111 . A perfect fit is achieved using at least three input photons, since a sufficient number of non-zero frequencies are encoded, i.e. Dn = 3. (b) The number of real degrees of freedom of the three mode parameterized quantum photonic circuit with PNR detectors (black) and threshold detectors (red). The former is always larger than the minimum number of parameters Mmin required to control all the circuit's Fourier coefficients (blue), and hence, arbitrary Fourier coefficients can be realized by this circuit. In contrast, the expressive power of the threshold detectors (red) can only be enhanced for up to 9 input photons.

Trainability of Fourier coefficients
Even if the parametrized QPC can generate the frequency spectrum required to fit the desired function, this does not necessarily imply that the optimal Fourier coefficients are accessible [24]; the linear circuits we consider cannot perform arbitrary Fock state transformations. However, we do not need to generate arbitrary Fock states and only require control over one real and D n complex Fourier coefficients {c ω }. For n input photons and taking D n = n, this requires at least M min = 2D n +1 real degrees of freedom.
Each trainable circuit blocks has m(m−1) controllable parameters [51], while the number of controllable degrees of freedom of the parameterized observable depends on type of detector. For photon number resolving (PNR) detectors the number of degrees of freedom is while threshold detectors have degrees of freedom. For a fixed number of spatial modes and photons, threshold detectors have fewer controllable degrees of freedom compared to PNR detectors, and hence their expressive power saturates beyond a certain number of input photons. For example, Fig. 3(b) illustrates the expressive power of a circuit with three spatial modes. Using threshold detectors the expressive power is only enhanced by increasing the number of photons up to nine; beyond this, the number of controllable degrees of freedom is less than M min . On the other hand, using PNR detectors the circuit may in principle be trained to fit arbitrarily large frequencies by increasing the number of input photons. Of course, in practice the range of frequencies accessible using a single encoding gate will be limited by sensitivity to losses, which grows exponentially with the photon number.

Universality of the linear quantum photonic circuit
It is well known that a Fourier series can approximate any square-integrable function g(x) on a finite interval to arbitrary precision [48,49]. Thus, expressing the nphoton quantum model in term of a Fourier series allows us to demonstrate the universality of the quantum model by studying its ability to realise arbitrary Fourier series. Universality of a Fourier series is determined by two important ingredients: a sufficiently-broad frequency spectrum and arbitrary realizable Fourier coefficients. The analysis in Sec. II B 1 implies that the frequency spectrum Ω n accessible by n-photon quantum models asymptotically contains any integer frequency if enough input photons are used, satisfying one of the criteria to achieve universality.
To realize arbitrary set of Fourier coefficients, at least M ≥ M min = 2n + 1 degrees of freedom in the linear QPC are required. Here, we consider a linear QPC with PNR detectors. The PNR detectors are used because the expressive power of threshold detectors saturated beyond some threshold number of photons. One of the unique advantages of photonic system is the exponentially growing dimension of the Fock space with number of spatial modes and photons. For a linear QPC with constant number of spatial modes m, the dimension of the Fock space and M PNR scales in the order of O(n m−1 ), hence contributing O(n m−1 ) of degrees of freedom. On the other hand, the degrees of freedom from the trainable circuit blocks scale with O(m 2 ), which is negligible when n m. By exploiting this advantage, it can be seen that M PNR is always larger than M min as the size of the frequency spectrum D n and M min scale linearly with photon number, i.e. O(n). This is a necessary condition for the n-photon quantum model being able to realize arbitrary set of Fourier coefficients, which in the examples we consider also seems to be sufficient. More rigorously, following the arguments in Ref. [24] a universal function approximator may be obtained by generalizing our circuits to arbitrary (entangled) input states and observables by incorporating nonlinear elements into the circuits [58].
As an example, we consider a linear QPC with 3 spatial modes. In this case, Eq. (12) is reduced to which is always larger than M min = 2n + 1 for n ∈ N, as shown in Fig. 3(b). Hence, the n-photon quantum model with 3 spatial modes and a single phase shifter can act as a universal function approximator. In contrast, the qubit-type variational quantum circuits require deep or wide circuits and many encoding gates to ensure a rich frequency spectrum, and arbitrary global unitaries to realize arbitrary sets of Fourier coefficients [24].

Effect of noise on the expressive power of linear quantum photonic circuits
For noiseless linear quantum photonic circuits, we have shown that its expressive power will improve with the increasing number of photons and spatial modes. In this section, we will discuss the role of optical losses on the expressive power of linear quantum photonic ciruits. For real quantum photonic hardware, the optical loss sensitivity will grows exponentially with the circuit depth and number of input photons. The typical noise sources are (1) inefficient collection optics, (2) losses in the optical components due to absorption, scattering, or reflections from the surfaces, (3) inefficiency in the detection process due to using detectors with imperfect quantum efficiency, and they can be modelled using beam splitters [59]. These noises will obviously affect the frequency spectrum, where the higher frequency term cannot be distinguished from the lower frequency term, hence reducing the size of the frequency spectrum. We anticipate the noises to have a minimum impact on the Fourier coefficients, as they depend only on the physical components such as linear optics in trainable blocks and the detectors. Therefore, the output observables can still be written as Fourier series, just with reduced expressivity. This will place a practical limit on the complexity of the QML models using this scheme, unless one can include some kind of error correction scheme. When the losses are low enough, the detectors should have a sufficiently high signal to noise ratio that other noise sources can be neglected. Apart from the error correction scheme, the regularization term in the cost function should be able to help to minimize the detrimental influence of noise. It penalizes models including coefficients with huge weights, hence no particular output state should have a huge weight, reducing the model's sensitivity of noise in final prediction. Finally, the photonic circuit considered here are based on variational approach, therefore, they are robust against variations in the beam splitter ratios, tuning, and etc. The quantitative noise modelling of the linear photonic quantum circuits will be a subject for future research.

III. SUPERVISED LEARNING USING LINEAR QUANTUM PHOTONIC CIRCUITS
As an application of the trainable linear QPCs we now consider different strategies for binary classification. In the first strategy the linear QPCs are directly used as variational quantum classifiers, classifying data directly on the high-dimensional Fock space by optimizing a regularized squared loss cost function. In this case, as the circuit becomes more expressive it becomes harder to train. Second, we consider kernel methods as a means of avoiding the costly circuit optimization step. We show how linear circuits can be used to implement Gaussian kernels either directly or using the random kitchen sinks algorithm, sampling kernels with different resolutions in parallel. Note that we are mainly interested in what kinds of kernel functions can be efficiently implemented using linear QPCs, instead of providing quantum kernels that might offer quantum advantages, motivated by ongoing interest in classical photonic circuits for machine learning [1][2][3][4].

A. Linear quantum photonic circuit as variational quantum classifiers
We perform binary classification of two-dimensional classical data. Each data dimension is encoded using a single phase shifter, as shown in Fig. 2(a). The mapping of the data into the high-dimensional Fock space is nonlinear, circumventing the need for nonlinear optical elements.
The n-photon supervised classification model for twodimensional data f (n) (x, Θ, λ) is defined as where x = (x 1 , x 2 ) is the 2-dimensional data feature, ω = (ω 1 , ω 2 ) is the 2-dimensional frequency vector, and Ω n = {−ω, 0, ω | ω 1 , ω 2 ∈ [0, n], ω 1 + ω 2 ≤ n} is the frequency spectrum of the model. This encoding scheme will not generate a full frequency spectrum for multidimensional Fourier series but it suffice for the example considered here. See Appendix B for schemes that gener-ate full frequency spectrum for multi-dimensional Fourier series. The model is trained by minimizing the cost function (16) using the BOBYQA algorithm, with the decision boundary defined as where Θ opt and λ opt are the optimized circuit's and observable's parameters and sgn is the sign function. Thus, the class of the data points is assigned by the sign of circuit output.
As an example, we trained the linear QPC to classify three different types of datasets from the scikit-learn machine learning library [60]: linear, circle, and moon. Fig. 4 illustrates the trained models. The contour plots show that n-photon supervised classification models with higher photon number have more complicated classification boundaries, in agreement with previous analysis on the expressive power of quantum models. Since the linear data set can be separated by a linear decision boundary, unsurprisingly a single model photon is sufficient to learn the classification boundary. On the other hand, overfitting can occur when the model expressive power is too large, as can be seen for the degraded performance for the circle dataset for the |221 input state. The classification performance for the more complicated moon dataset improves with the number of input photons. These examples illustrate the impact of a higher expressive power on classification using linear QPCs.

B. Linear quantum photonic circuits as Gaussian kernel samplers
Similar to the standard noisy and large scale variational circuits, the variational machine learning approach becomes more difficult to train as the dimension of the Fock space increases, likely due to the issue of vanishing cost function gradients [61][62][63][64][65], requiring exponentiallygrowing precision to optimize the circuit parameters insitu [66]. In addition, it is expensive to train the quantum gates (in this case the tunable beam splitter meshes) in the noisy-intermediate scale quantum (NISQ) era as it is time-consuming to reprogram quantum circuits [46]. Due to these limitations, it is more efficient to use NISQ devices as sub-routines for machine learning algorithms, e.g. to sample quantities that are useful for classical machine learning models but time-consuming to compute. In particular, variational quantum circuits can be used to approximate kernel functions for classical kernel models such as support vector machines [10,13,16,20,67]. Here, we show how the linear QPCs can be designed to approximate Gaussian kernels with a range of resolutions determined by the number of input photons. The performance for the circle dataset degrades for larger input photons due to over-fitting, demonstrating that a larger expressive power is not necessarily better. On the other hand, a higher expressive power is necessary in order to accurately classify the moon dataset.

Kernel methods
Kernel methods allow one to apply linear classification algorithms to datasets with nonlinear decision boundaries [68,69]. The idea is to leverage feature maps φ(x) that map the nonlinear dataset from its original space into a higher dimensional feauture space in which a linear decision boundary can be found, enabling classification via a linear regression using suitably-trained weights w. Instead of computing and storing the high-dimensional feature vector φ, the kernel trick [69,70] is employed by introducing a kernel function k(x, x ), which measures the pairwise similarity between the data points in the feature space. Formally, the kernel functions is defined as the inner product of two feature vectors According to representer theorem [71], the solution to the decision boundary can then be expressed in term of the kernel functions as converting the optimization problem into a convex optimization problem of finding the parameters β i . In the case of a regularized squared loss cost function such as Eq. (16) the optimal parameters β i can be obtained analytically [72,73] as where N is the number of training data, K is the N × N kernel matrix with matrix elements K ij = k(x i , x j ), I is the N -dimensional identity matrix, α is the regularization parameter, and y is the N × 1 vector of the training data labels. Although we currently have an example that shows rigorous performance guarantees of quantum kernel methods on artificial dataset [47], it is still unclear whether quantum machine learning models can achieve improved performance compared to classical machine learning approaches in practical problems by sampling from kernels that are hard to compute classically [13,21,74,75]. Even in the absence of a rigorous quantum advantage, special-purpose electronic and photonic machine learning circuits are being pursued in order to increase the speed and energy-efficiency of well-established classical machine learning models [1][2][3][4]. Therefore, here we will focus on implementing the widely-used Gaussian kernel with controllable resolution σ. The Gaussian kernel is a universal, infinite-dimensional kernel that can learn any continuous function in a compact space [76].

Linear quantum photonic circuits as sub-routine of kernel methods
We approximate the Gaussian kernel using the two mode QPC shown in Fig. 2(b), where 50-50 beamsplitters H are used for both trainable circuit blocks and the squared Euclidean distance between pairs of data points δ = (x − x ) 2 is encoded using a single phase shifter. The output of this circuit can be written as where U(δ) = HS(δ)H and the trainable observable M(λ) = ij λ ij |n i , n j n i , n j | with n i + n j = n. Similar to Sec. III A, the observable is trained to approximate the Gaussian kernel of resolution σ by minimizing the squared loss cost function using the BOBYQA algorithm This approach has two advantages: Different kernel resolutions can be accessed using the same photon detection statistics | n i , n j | U(δ) |n, 0 | 2 by taking different linear combinations of the output observables M(λ (σ) ). Second, this training only needs to be performed once; the tunable circuit blocks do not need to be reconfigured if the training data set changes. We note that the domain of the input data, in this case the norm squared distances between pairs of data points, must lie within the interval that defines the circuit's Fourier series. This imposes an upper bound on the kernel resolution that the linear QPC has access to. The circuit with higher expressive power, i.e. higher number of input photons, can more precisely approximate kernels with higher resolution σ. Kernels with lower resolution can already be well-approximated by a circuit with only two input photons. Fig. 5 shows the kernel training result for different desired resolutions and input photon numbers. Once the kernel has been trained, classification can be performed by feeding the measured similarity matrix into a classical machine learning model such as a support vector machine [77].

C. Quantum-enhanced random kitchen sinks
One limitation of kernel methods is their poor (quadratic) scaling with the size of the training data set. To circumvent this issue, the random kitchen sinks (RKS) algorithm was developed, which uses randomlysampled feature maps in order to controllably approximate desired kernels and more efficiently train classical machine learning models [78][79][80]. In particular, sampling from random Fourier features enables approximation of the Gaussian kernel. This motivates us to propose a quantum-enhanced RKS algorithm, where the subroutine of RKS algorithm, i.e. the random feature sampler is replaced by linear QPCs. The linear QPCs can simultaneously sample random Fourier features of different frequencies, providing a unique advantage compare to the qubit-based architecture. Our approach differs from previous proposals for quantum random kitchen sinks by directly constructing the kernel functions using the random Fourier features, instead of performing linear regression with random feature bit strings sampled from variational quantum circuits [81,82].

Random kitchen sinks
The randomized R-dimensional vectors known as the random Fourier features are defined as When R = 1 the feature vectors reduce to a cosine-like kernel whose frequency increases with the number of input photons and k. The classification results improve with R because the kernels are better approximated by random Fourier features of higher dimension with σ = 0.25 and 1/7 (R = 100), which are the optimal resolutions for the moon dataset. For a given R, the decision boundaries for higher resolutions are noisier because the corresponding approximated kernel has a narrower peak, thus meaningful predictions cannot be made for points that are far (relative to the kernel resolution) from the training set.
where each z wr (x) is a randomized cosine function x is the D-dimensional input data, w r are D-dimensional random vector sampled from a spherical Gaussian distribution, and b r are random scalars sampled from a uniform distribution, w ∼ N D (0, I); b ∼ Uniform(0, 2π).
The random Fourier features approximate the Gaussian kernel [78,79] z with γ acting as a hyperparameter that controls the kernel resolution. Note that other commonly-used kernels including Laplacian and Cauchy kernels can be approximated using the RKS algorithm using different sampling distributions [78]. Substituting Eq. (23) into Eq. (19) yields where c = i β i z(x i ). The optimal solution for a supervised learning problem using training data {(x j , y j )} N j=1 that minimizes the regularized squared loss cost function [72,73,83] in Eq. (24) is where and I R is the R-dimensional identity matrix. The RKS algorithm addresses the poor scaling of kernel methods with the number of data points by mapping the data into a randomized low-dimensional feature space, turning Eq. (19) into a linear model on the Rdimensional vectors z(x). The complexity of finding the analytical solution for the coefficients is reduced from O(N 3 ) to O(R 3 ), saving enormous amounts of resources when R N while maintaining model performance comparable to standard classification methods [78,79].

Linear quantum photonic circuits as random Fourier feature samplers
Using the same circuit as in Sec. III B 2 with a randomized input encoding [ Fig. 2(b)], i.e. x r,i = γ(w r · x i + b r ), the circuit output becomes Constructing different observables M(λ (k) ) from the same photon detection statistics | n i , n j | U(x r,i ) |n, 0 | 2 allows one to isolate cosine functions with different frequencies k which has the same structure as Eq. (22) with γ → kγ. Thus, constructing the random Fourier features in Eq. (21) with the randomized cosine functions Eq. (28) enables us to approximate the kernel with resolution σ = 1 kγ . In other words, Gaussian kernels with different resolutions can be accessed using a single QPC and the same set of measurements by considering different observables. The number of kernel resolutions accessible by the circuit is equal to the size of the frequency spectrum, i.e. circuits with more input photons have access to more resolutions. Here, the photonnumber dependent expressive power of the linear QPCs is leveraged to produce a linear combination of cosine functions of different frequencies, simultaneously producing multiple random Fourier features that approximate Gaussian kernels of different resolutions. Fig. 6 illustrates the performance of moon dataset classifiers using circuits with 10 input photons and random Fourier features of different dimensions, i.e. R = 1, 10, 100 and the same decision boundary as in Eq. (17). The circuit with input 10 photons can probe a range of kernel resolutions within one order of magnitude, e.g: for γ = 1, the accessible resolutions are σ = {1/n | 1 ≤ n ≤ 10}; six of these are shown in Fig. 6 to illustrate the working principle of quantum-enhanced RKS. The decision boundary of smaller σ is considerably noisier than for larger σ. This is because the kernel with smaller resolution has a narrower peak, and hence, predictions far away from the training data points cannot be made. The random Fourier features with higher dimensionality provides a better approximation to the kernel, thus suppressing the noise around training data points while improving the classification accuracy. The optimal resolution for the moon dataset is σ = 0.25 and 1/7 for R = 100.

D. Resource requirements for each scheme
Each of the classification methods has different strengths and limitations in terms of the resource requirements, i.e. number of distinct circuit evaluations for performing training and predictions, summarized in Tab. I. Here, we are concerned only with number of distinct circuit evaluations required to perform training and prediction, since the quantum resources are much more precious than the classical resources in the NISQ era. For the variational circuit, the data features are directly encoded, but the beam splitters and phase shifters in the trainable circuit blocks need to be optimized in the training step. Hence, the training resource per optimization Kernel methods, on the other hand, encode the differences between data inputs using one phase shifter and the training is outsourced to a classical computer, therefore the resources for training scale only with the number of training data, i.e. O(N 2 ). The Gaussian kernel with different resolutions can be accessed with a fixed circuit by considering different observables; Gaussian kernels with higher resolutions are better approximated for circuits with larger numbers of input photons. In contrast to the variational methods, N different phase shifter settings are required to make predictions on new data. Random kitchen sinks have similar advantages to kernel methods, i.e. fixed circuit and different resolutions can be accessed by different observables, but have a better scaling O(N R) with number of input data points, where R is the number of random features chosen. The predictions require R circuit settings regardless of the dimension of the data features.

IV. CONCLUSION
The data-embedding process is a bottleneck which must be addressed [46] in order to fully leverage the potential of quantum machine learning algorithms. In this paper, we addressed the data encoding problem by proposing a more gate-efficient bosonic encoding method. Our method has three potential advantages. First, it allows for a more efficient data encoding by modulating all Fock basis simultaneously using only one phase shifter, regardless of the input photon number. Second, the circuits employed a kernel-like trick, where nonlinearity is outsourced to quantum feature maps, i.e. the dataencoding phase shifter that encode the classical data into the high-dimensional Fock space [13,20], avoiding the need of the experimental hard-to-implement nonlinear optical components. Subsequently, the expressive power of the circuit can be controlled by the number of input photons, while requiring fewer encoding layers compared to the qubit-based architecture [23,24]. Finally, the circuits can be trained to implement commonly-used kernels with well-understood properties such as the Gaussian kernel.
Even though our photonic models are inspired by the BosonSampling circuits [84], we do not expect the arguments about the BosonSampling's classical nonsimulability to hold for our circuits, for three reasons: (1) The model output is expectation values, not samples.
(2) Our phoronic circuits are not sampled from the Haar random distribution. (3) The assumption of m = n 2 is relaxed, where m is the number of optical modes and n is the number of input photons. Even so, there exist other benefits of studying the use of this class of circuit as quantum machine learning models. Quantum machine learning is still in its infancy, and it is still unclear how to rigorously define a quantum advantage for generic machine learning problems [85]. In this work, we focused on a specific problem in this field, the data-encoding problem, showing using simple quantum machine learning models how bosonic circuits may enable more efficient data uploading. We expect our conclusions to be valid for other classes of quantum machine learning models which may be hard to classically simulate. In addition, we believe our photonic models will serve as primitive quantum machine learning model [85] that inspire researchers in the field to develop other photonic quantum machine learning algorithms that possess quantum advantages. Recently, the awareness of the importance of the energy aspect of quantum algorithms has been raised [86]. Although the energy aspect of our quantum circuits is not studied, our models could inspire an application-oriented framework to compare the energy consumption of quantum machine learning based on different platforms.
While the dimension of the Fock space grows (exponentially) with photon number and spatial modes, improving the expressive power, this is accompanied by higher sensitivity to optical losses and the need for more detection events in order to accurately sample all of the required output observables. Moreover, there exists a trade-off between the model's expressive power and ability to generalize, where circuits with higher expressive power can suffer from larger generalization errors, i.e. over-fitting [87,88] and trainability issue [33]. One potential way to mitigates these issues is to define the quantum machine learning models in projected Fock spaces, which may lead to potential quantum advantages [74].
We proposed three different ways with different resource requirements to perform binary classification using linear quantum photonics circuits (QPCs). (1) Variational quantum classifiers that classify data points directly on the high-dimensional Fock space, while (2) and (3) implement Gaussian kernel for classical kernel machines directly or using the random kitchen sinks algorithm, sampling kernels with different resolutions in parallel. The random kitchen sink approach could be further improved by sampling the random features from a data-optimized distribution using fault-tolerant quantum computers [89,90]. A linear QPC with three spatial modes and up to 10 input photons equipped with photonnumber resolving detectors is sufficient to show a proof of concept experiment for all of the proposed approaches. Therefore, our proposed architecture can be implemented with current technology, such as integrated photonic circuits [91][92][93] or bulk optics [94] used for BosonSampling experiments [95]. Other experimental aspects such as the impact of the degree of multi-photon distinguishability and exponentially-scaling photon losses on expressive power are subject for future research.
While this article investigated the expressive power of the linear QPCs, the trainability and generalization power of the linear QPCs remains an open question. Apart from the gradient-free method used in this article for QML model training, gradient-based methods with analytical gradient [96][97][98] can potentially boost the training speed. However, current analytical gradient evaluation methods only apply to the photonic circuit with 1-photon input Fock state [99] or continuous variable quantum photonic systems [100][101][102][103]. Hence, more research needs to be done to find the analytical gradient for the quantum photonic circuits with general input Fock state, which requires the differentiation of the permanents of the transfer matrix. It will also be interesting to see the effect of different input states, i.e. coherent states and squeezed states on the expressive power, trainability, and generalization power of the linear QPCs [104]. It will be interesting to further explore the translation of ideas between classical and quantum photonic circuits for machine learning.

ACKNOWLEDGMENTS
This research was supported by the National Research Foundation, Prime Minister's Office, Singapore, the Ministry of Education, Singapore under the Research Centres of Excellence programme, and the Polisimulator project co-financed by Greece and the EU Regional Development Fund.
Appendix A: General encoding scheme for 1D frequency spectrum In Sec.II B 1, we have derived the frequency spectrum for linear QPCs with single data encoding block that consists only one data encoding phase shifter. In this section, we broaden the frequency spectrum by adding more data encoding phase shifters into the same layer (series encoding) and different layers (parallel encoding). In the series encoding scheme [ Fig.7(a)], we consider linear QPCs with single data encoding block that consists of m − 1 phase shifters, i.e. the highest possible number of phase shifters that can be placed within a data encoding block. For parallel encoding scheme [ Fig.7(b)], the m − 1 phase shifters are equally distributed among m − 1 data encoding blocks. One could consider different combinations of phase shifters in each layers and the expressive power will change accordingly.
As shown in Sec.II B 1, the size of the frequency spectrum of a m mode linear QPC with one data encoding phase shifter is given by D (n,1,1) = n, where D (n,L,q) (two additional subscripts are added for clarity) denotes the size of frequency spectrum realizable by linear QPCs with n input photons and L data encoding blocks, each block consists of q data encoding phase shifters. For series encoding (L = 1), we can place one data encoding phase shifter per mode on the first m − 1 mode, each encodes phase proportional to its mode number [ Fig.7(a)], i.e. ix phase shift with i denotes the mode number. The range of phases that could pick-up by n photon is [0, (m − 1)n], where the lower (upper) bound is obtained when all photon passes through the last (second last) mode. Hence, the size of the frequency spectrum D (n,1,m−1) is (m − 1)n. Identical range of phases is also apply to the parallel encoding scheme, where the lower (upper) bound is achieved when none (all) of the pho-ton passes through the first mode on each layer, thus, D (n,m−1,1) = (m − 1)n.
Appendix B: Encoding scheme to generate full frequency spectrum for multi-dimensional Fourier series In this section, we will introduce the series and parallel encoding schemes that can generate a full frequency spectrum for multi-dimensional Fourier series. For series encoding scheme [ Fig.7(c)], one would need 2 d − 1 phase shifters to encode the positive phases of d-dimensional degree 1 Fourier series, i.e. ({ d r=i r i x i |r 1 , r 2 , ..., r d ∈ {0, 1}}\{0}) . For example, one can use 7 phase shifters to encode {x 1 , x 2 , x 3 , x 1 +x 2 , x 1 +x 3 , x 2 +x 3 , x 1 +x 2 +x 3 }. Then, the frequency spectrum of d dimensional degree n Fourier series, i.e. Ω .., n} can be generated using n input photons. On the other hand, the same set of frequency spectrum can be generated using d data encoding blocks, each consists of one data encoding phase shifter that encodes one data feature. [Fig.7(d)].