Quantum computers promise to enhance machine learning for practical applications. Quantum machine learning for real-world data has to handle extensive amounts of high-dimensional data. However, conventional methods for measuring quantum kernels are impractical for large datasets as they scale with the square of the dataset size. Here, we measure quantum kernels using randomized measurements. The quantum computation time scales linearly with dataset size and quadratic for classical post-processing. While our method scales in general exponentially in qubit number, we gain a substantial speed-up when running on intermediate-sized quantum computers. Further, we efficiently encode high-dimensional data into quantum computers with the number of features scaling linearly with the circuit depth. The encoding is characterized by the quantum Fisher information metric and is related to the radial basis function kernel. Our approach is robust to noise via a cost-free error mitigation scheme. We demonstrate the advantages of our methods for noisy quantum computers by classifying images with the IBM quantum computer. To achieve further speedups we distribute the quantum computational tasks between different quantum computers. Our method enables benchmarking of quantum machine learning algorithms with large datasets on currently available quantum computers.
We propose a class of randomized quantum algorithms for the task of sampling from matrix functions, without the use of quantum block encodings or any other coherent oracle access to the matrix elements. As such, our use of qubits is purely algorithmic, and no additional qubits are required for quantum data structures. For $N\times N$ Hermitian matrices, the space cost is $\log(N)+1$ qubits and depending on the structure of the matrices, the gate complexity can be comparable to state-of-the-art methods that use quantum data structures of up to size $O(N^2)$, when considering equivalent end-to-end problems. Within our framework, we present a quantum linear system solver that allows one to sample properties of the solution vector, as well as an algorithm for sampling properties of ground states of Hamiltonians. As a concrete application, we combine these two sub-routines to present a scheme for calculating Green's functions of quantum many-body systems.
In machine learning, the selection of a promising model from a potentially large number of competing models and the assessment of its generalization performance are critical tasks that need careful consideration. Typically, model selection and evaluation are strictly separated endeavors, splitting the sample at hand into a training, validation, and evaluation set, and only compute a single confidence interval for the prediction performance of the final selected model. We however propose an algorithm how to compute valid lower confidence bounds for multiple models that have been selected based on their prediction performances in the evaluation set by interpreting the selection problem as a simultaneous inference problem. We use bootstrap tilting and a maxT-type multiplicity correction. The approach is universally applicable for any combination of prediction models, any model selection strategy, and any prediction performance measure that accepts weights. We conducted various simulation experiments which show that our proposed approach yields lower confidence bounds that are at least comparably good as bounds from standard approaches, and that reliably reach the nominal coverage probability. In addition, especially when sample size is small, our proposed approach yields better performing prediction models than the default selection of only one model for evaluation does.
In the age of big data and interpretable machine learning, approaches need to work at scale and at the same time allow for a clear mathematical understanding of the method's inner workings. While there exist inherently interpretable semi-parametric regression techniques for large-scale applications to account for non-linearity in the data, their model complexity is still often restricted. One of the main limitations are missing interactions in these models, which are not included for the sake of better interpretability, but also due to untenable computational costs. To address this shortcoming, we derive a scalable high-order tensor product spline model using a factorization approach. Our method allows to include all (higher-order) interactions of non-linear feature effects while having computational costs proportional to a model without interactions. We prove both theoretically and empirically that our methods scales notably better than existing approaches, derive meaningful penalization schemes and also discuss further theoretical aspects. We finally investigate predictive and estimation performance both with synthetic and real data.
Machine learning algorithms based on parametrized quantum circuits are prime candidates for near-term applications on noisy quantum computers. In this direction, various types of quantum machine learning models have been introduced and studied extensively. Yet, our understanding of how these models compare, both mutually and to classical models, remains limited. In this work, we identify a constructive framework that captures all standard models based on parametrized quantum circuits: that of linear quantum models. In particular, we show using tools from quantum information theory how data re-uploading circuits, an apparent outlier of this framework, can be efficiently mapped into the simpler picture of linear models in quantum Hilbert spaces. Furthermore, we analyze the experimentally-relevant resource requirements of these models in terms of qubit number and amount of data needed to learn. Based on recent results from classical machine learning, we prove that linear quantum models must utilize exponentially more qubits than data re-uploading models in order to solve certain learning tasks, while kernel methods additionally require exponentially more data points. Our results provide a more comprehensive view of quantum machine learning models as well as insights on the compatibility of different models with NISQ constraints.
Nowadays, many researchers are focusing their attention on the subject of machine translation (MT). However, Persian machine translation has remained unexplored despite a vast amount of research being conducted in languages with high resources, such as English. Moreover, while a substantial amount of research has been undertaken in statistical machine translation for some datasets in Persian, there is currently no standard baseline for transformer-based text2text models on each corpus. This study collected and analysed the most popular and valuable parallel corpora, which were used for Persian-English translation. Furthermore, we fine-tuned and evaluated two state-of-the-art attention-based seq2seq models on each dataset separately (48 results). We hope this paper will assist researchers in comparing their Persian to English and vice versa machine translation results to a standard baseline.
Classical quickest change detection algorithms require modeling pre-change and post-change distributions. Such an approach may not be feasible for various machine learning models because of the complexity of computing the explicit distributions. Additionally, these methods may suffer from a lack of robustness to model mismatch and noise. This paper develops a new variant of the classical Cumulative Sum (CUSUM) algorithm for the quickest change detection. This variant is based on Fisher divergence and the Hyv\"arinen score and is called the Score-based CUSUM (SCUSUM) algorithm. The SCUSUM algorithm allows the applications of change detection for unnormalized statistical models, i.e., models for which the probability density function contains an unknown normalization constant. The asymptotic optimality of the proposed algorithm is investigated by deriving expressions for average detection delay and the mean running time to a false alarm. Numerical results are provided to demonstrate the performance of the proposed algorithm.
Regression models that ignore measurement error in predictors may produce highly biased estimates leading to erroneous inferences. It is well known that it is extremely difficult to take measurement error into account in Gaussian nonparametric regression. This problem becomes tremendously more difficult when considering other families such as logistic regression, Poisson and negative-binomial. For the first time, we present a method aiming to correct for measurement error when estimating regression functions flexibly covering virtually all distributions and link functions regularly considered in generalized linear models. This approach depends on approximating the first and the second moment of the response after integrating out the true unobserved predictors in a semiparametric generalized linear model. Unlike previous methods, this method is not restricted to truncated splines and can utilize various basis functions. Through extensive simulation studies, we study the performance of our method under many scenarios.
Multi-dimensional direct numerical simulation (DNS) of the Schr\"odinger equation is needed for design and analysis of quantum nanostructures that offer numerous applications in biology, medicine, materials, electronic/photonic devices, etc. In large-scale nanostructures, extensive computational effort needed in DNS may become prohibitive due to the high degrees of freedom (DoF). This study employs a reduced-order learning algorithm, enabled by the first principles, for simulation of the Schr\"odinger equation to achieve high accuracy and efficiency. The proposed simulation methodology is applied to investigate two quantum-dot structures; one operates under external electric field, and the other is influenced by internal potential variation with periodic boundary conditions. The former is similar to typical operations of nanoelectronic devices, and the latter is of interest to simulation and design of nanostructures and materials, such as applications of density functional theory. Using the proposed methodology, a very accurate prediction can be realized with a reduction in the DoF by more than 3 orders of magnitude and in the computational time by 2 orders, compared to DNS. The proposed physics-informed learning methodology is also able to offer an accurate prediction beyond the training conditions, including higher external field and larger internal potential in untrained quantum states.
One way of introducing sparsity into deep networks is by attaching an external table of parameters that is sparsely looked up at different layers of the network. By storing the bulk of the parameters in the external table, one can increase the capacity of the model without necessarily increasing the inference time. Two crucial questions in this setting are then: what is the lookup function for accessing the table and how are the contents of the table consumed? Prominent methods for accessing the table include 1) using words/wordpieces token-ids as table indices, 2) LSH hashing the token vector in each layer into a table of buckets, and 3) learnable softmax style routing to a table entry. The ways to consume the contents include adding/concatenating to input representation, and using the contents as expert networks that specialize to different inputs. In this work, we conduct rigorous experimental evaluations of existing ideas and their combinations. We also introduce a new method, alternating updates, that enables access to an increased token dimension without increasing the computation time, and demonstrate its effectiveness in language modeling.
Since hardware resources are limited, the objective of training deep learning models is typically to maximize accuracy subject to the time and memory constraints of training and inference. We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute: self-supervised pretraining and high-resource machine translation. We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps. Moreover, this acceleration in convergence typically outpaces the additional computational overhead of using larger models. Therefore, the most compute-efficient training strategy is to counterintuitively train extremely large models but stop after a small number of iterations. This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models. However, we show that large models are more robust to compression techniques such as quantization and pruning than small models. Consequently, one can get the best of both worlds: heavily compressed, large models achieve higher accuracy than lightly compressed, small models.