We introduce a new approach for quantum linear algebra based on quantum subspace states and present three new quantum machine learning algorithms. The first is a quantum determinant sampling algorithm that samples from the distribution $\Pr[S]= det(X_{S}X_{S}^{T})$ for $|S|=d$ using $O(nd)$ gates and with circuit depth $O(d\log n)$. The state of art classical algorithm for the task requires $O(d^{3})$ operations \cite{derezinski2019minimax}. The second is a quantum singular value estimation algorithm for compound matrices $\mathcal{A}^{k}$, the speedup for this algorithm is potentially exponential. It decomposes a $\binom{n}{k}$ dimensional vector of order-$k$ correlations into a linear combination of subspace states corresponding to $k$-tuples of singular vectors of $A$. The third algorithm reduces exponentially the depth of circuits used in quantum topological data analysis from $O(n)$ to $O(\log n)$. Our basic tool are quantum subspace states, defined as $|Col(X)\rangle = \sum_{S\subset [n], |S|=d} det(X_{S}) |S\rangle$ for matrices $X \in \mathbb{R}^{n \times d}$ such that $X^{T} X = I_{d}$, that encode $d$-dimensional subspaces of $\mathbb{R}^{n}$. We develop two efficient state preparation techniques, the first using Givens circuits uses the representation of a subspace as a sequence of Givens rotations, while the second uses efficient implementations of unitaries $\Gamma(x) = \sum_{i} x_{i} Z^{\otimes (i-1)} \otimes X \otimes I^{n-i}$ with $O(\log n)$ depth circuits that we term Clifford loaders.
Many existing algorithms for streaming geometric data analysis have been plagued by exponential dependencies in the space complexity, which are undesirable for processing high-dimensional data sets. In particular, once $d\geq\log n$, there are no known non-trivial streaming algorithms for problems such as maintaining convex hulls and L\"owner-John ellipsoids of $n$ points, despite a long line of work in streaming computational geometry since [AHV04]. We simultaneously improve these results to $\mathrm{poly}(d,\log n)$ bits of space by trading off with a $\mathrm{poly}(d,\log n)$ factor distortion. We achieve these results in a unified manner, by designing the first streaming algorithm for maintaining a coreset for $\ell_\infty$ subspace embeddings with $\mathrm{poly}(d,\log n)$ space and $\mathrm{poly}(d,\log n)$ distortion. Our algorithm also gives similar guarantees in the \emph{online coreset} model. Along the way, we sharpen results for online numerical linear algebra by replacing a log condition number dependence with a $\log n$ dependence, answering a question of [BDM+20]. Our techniques provide a novel connection between leverage scores, a fundamental object in numerical linear algebra, and computational geometry. For $\ell_p$ subspace embeddings, we give nearly optimal trade-offs between space and distortion for one-pass streaming algorithms. For instance, we give a deterministic coreset using $O(d^2\log n)$ space and $O((d\log n)^{1/2-1/p})$ distortion for $p>2$, whereas previous deterministic algorithms incurred a $\mathrm{poly}(n)$ factor in the space or the distortion [CDW18]. Our techniques have implications in the offline setting, where we give optimal trade-offs between the space complexity and distortion of subspace sketch data structures. To do this, we give an elementary proof of a "change of density" theorem of [LT80] and make it algorithmic.
Stochastic Gradient Descent (SGD) is a central tool in machine learning. We prove that SGD converges to zero loss, even with a fixed (non-vanishing) learning rate - in the special case of homogeneous linear classifiers with smooth monotone loss functions, optimized on linearly separable data. Previous works assumed either a vanishing learning rate, iterate averaging, or loss assumptions that do not hold for monotone loss functions used for classification, such as the logistic loss. We prove our result on a fixed dataset, both for sampling with or without replacement. Furthermore, for logistic loss (and similar exponentially-tailed losses), we prove that with SGD the weight vector converges in direction to the $L_2$ max margin vector as $O(1/\log(t))$ for almost all separable datasets, and the loss converges as $O(1/t)$ - similarly to gradient descent. Lastly, we examine the case of a fixed learning rate proportional to the minibatch size. We prove that in this case, the asymptotic convergence rate of SGD (with replacement) does not depend on the minibatch size in terms of epochs, if the support vectors span the data. These results may suggest an explanation to similar behaviors observed in deep networks, when trained with SGD.
SVD (singular value decomposition) is one of the basic tools of machine learning, allowing to optimize basis for a given matrix. However, sometimes we have a set of matrices $\{A_k\}_k$ instead, and would like to optimize a single common basis for them: find orthogonal matrices $U$, $V$, such that $\{U^T A_k V\}$ set of matrices is somehow simpler. For example DCT-II is orthonormal basis of functions commonly used in image/video compression - as discussed here, this kind of basis can be quickly automatically optimized for a given dataset. While also discussed gradient descent optimization might be computationally costly, there is proposed CSVD (common SVD): fast general approach based on SVD. Specifically, we choose $U$ as built of eigenvectors of $\sum_i (w_k)^q (A_k A_k^T)^p$ and $V$ of $\sum_k (w_k)^q (A_k^T A_k)^p$, where $w_k$ are their weights, $p,q>0$ are some chosen powers e.g. 1/2, optionally with normalization e.g. $A \to A - rc^T$ where $r_i=\sum_j A_{ij}, c_j =\sum_i A_{ij}$.
In this work, we study the transfer learning problem under high-dimensional generalized linear models (GLMs), which aim to improve the fit on target data by borrowing information from useful source data. Given which sources to transfer, we propose a transfer learning algorithm on GLM, and derive its $\ell_1/\ell_2$-estimation error bounds as well as a bound for a prediction error measure. The theoretical analysis shows that when the target and source are sufficiently close to each other, these bounds could be improved over those of the classical penalized estimator using only target data under mild conditions. When we don't know which sources to transfer, an algorithm-free transferable source detection approach is introduced to detect informative sources. The detection consistency is proved under the high-dimensional GLM transfer learning setting. We also propose an algorithm to construct confidence intervals of each coefficient component, and the corresponding theories are provided. Extensive simulations and a real-data experiment verify the effectiveness of our algorithms. We implement the proposed GLM transfer learning algorithms in a new R package glmtrans, which is available on CRAN.
Let $X^{(n)}$ be an observation sampled from a distribution $P_{\theta}^{(n)}$ with an unknown parameter $\theta,$ $\theta$ being a vector in a Banach space $E$ (most often, a high-dimensional space of dimension $d$). We study the problem of estimation of $f(\theta)$ for a functional $f:E\mapsto {\mathbb R}$ of some smoothness $s>0$ based on an observation $X^{(n)}\sim P_{\theta}^{(n)}.$ Assuming that there exists an estimator $\hat \theta_n=\hat \theta_n(X^{(n)})$ of parameter $\theta$ such that $\sqrt{n}(\hat \theta_n-\theta)$ is sufficiently close in distribution to a mean zero Gaussian random vector in $E,$ we construct a functional $g:E\mapsto {\mathbb R}$ such that $g(\hat \theta_n)$ is an asymptotically normal estimator of $f(\theta)$ with $\sqrt{n}$ rate provided that $s>\frac{1}{1-\alpha}$ and $d\leq n^{\alpha}$ for some $\alpha\in (0,1).$ We also derive general upper bounds on Orlicz norm error rates for estimator $g(\hat \theta)$ depending on smoothness $s,$ dimension $d,$ sample size $n$ and the accuracy of normal approximation of $\sqrt{n}(\hat \theta_n-\theta).$ In particular, this approach yields asymptotically efficient estimators in some high-dimensional exponential models.
One of the main challenges in solving time-dependent partial differential equations is to develop computationally efficient solvers that are accurate and stable. Here, we introduce a graph neural network approach to finding efficient PDE solvers through learning using message-passing models. We first introduce domain invariant features for PDE-data inspired by classical PDE solvers for an efficient physical representation. Next, we use graphs to represent PDE-data on an unstructured mesh and show that message passing graph neural networks (MPGNN) can parameterize governing equations, and as a result, efficiently learn accurate solver schemes for linear/nonlinear PDEs. We further show that the solvers are independent of the initial trained geometry, i.e. the trained solver can find PDE solution on different complex domains. Lastly, we show that a recurrent graph neural network approach can find a temporal sequence of solutions to a PDE.
This paper is devoted to a practical method for ferroalloys consumption modeling and optimization. We consider the problem of selecting the optimal process control parameters based on the analysis of historical data from sensors. We developed approach, which predicts results of chemical reactions and give ferroalloys consumption recommendation. The main features of our method are easy interpretation and noise resistance. Our approach is based on k-means clustering algorithm, decision trees and linear regression. The main idea of the method is to identify situations where processes go similarly. For this, we propose using a k-means based dataset clustering algorithm and a classification algorithm to determine the cluster. This algorithm can be also applied to various technological processes, in this article, we demonstrate its application in metallurgy. To test the application of the proposed method, we used it to optimize ferroalloys consumption in Basic Oxygen Furnace steelmaking when finishing steel in a ladle furnace. The minimum required element content for a given steel grade was selected as the predictive model's target variable, and the required amount of the element to be added to the melt as the optimized variable. Keywords: Clustering, Machine Learning, Linear Regression, Steelmaking, Optimization, Gradient Boosting, Artificial Intelligence, Decision Trees, Recommendation services
Present-day atomistic simulations generate long trajectories of ever more complex systems. Analyzing these data, discovering metastable states, and uncovering their nature is becoming increasingly challenging. In this paper, we first use the variational approach to conformation dynamics to discover the slowest dynamical modes of the simulations. This allows the different metastable states of the system to be located and organized hierarchically. The physical descriptors that characterize metastable states are discovered by means of a machine learning method. We show in the cases of two proteins, Chignolin and Bovine Pancreatic Trypsin Inhibitor, how such analysis can be effortlessly performed in a matter of seconds. Another strength of our approach is that it can be applied to the analysis of both unbiased and biased simulations.
There are many important high dimensional function classes that have fast agnostic learning algorithms when strong assumptions on the distribution of examples can be made, such as Gaussianity or uniformity over the domain. But how can one be sufficiently confident that the data indeed satisfies the distributional assumption, so that one can trust in the output quality of the agnostic learning algorithm? We propose a model by which to systematically study the design of tester-learner pairs $(\mathcal{A},\mathcal{T})$, such that if the distribution on examples in the data passes the tester $\mathcal{T}$ then one can safely trust the output of the agnostic learner $\mathcal{A}$ on the data. To demonstrate the power of the model, we apply it to the classical problem of agnostically learning halfspaces under the standard Gaussian distribution and present a tester-learner pair with a combined run-time of $n^{\tilde{O}(1/\epsilon^4)}$. This qualitatively matches that of the best known ordinary agnostic learning algorithms for this task. In contrast, finite sample Gaussian distribution testers do not exist for the $L_1$ and EMD distance measures. A key step in the analysis is a novel characterization of concentration and anti-concentration properties of a distribution whose low-degree moments approximately match those of a Gaussian. We also use tools from polynomial approximation theory. In contrast, we show strong lower bounds on the combined run-times of tester-learner pairs for the problems of agnostically learning convex sets under the Gaussian distribution and for monotone Boolean functions under the uniform distribution over $\{0,1\}^n$. Through these lower bounds we exhibit natural problems where there is a dramatic gap between standard agnostic learning run-time and the run-time of the best tester-learner pair.
Minimizing cross-entropy over the softmax scores of a linear map composed with a high-capacity encoder is arguably the most popular choice for training neural networks on supervised learning tasks. However, recent works show that one can directly optimize the encoder instead, to obtain equally (or even more) discriminative representations via a supervised variant of a contrastive objective. In this work, we address the question whether there are fundamental differences in the sought-for representation geometry in the output space of the encoder at minimal loss. Specifically, we prove, under mild assumptions, that both losses attain their minimum once the representations of each class collapse to the vertices of a regular simplex, inscribed in a hypersphere. We provide empirical evidence that this configuration is attained in practice and that reaching a close-to-optimal state typically indicates good generalization performance. Yet, the two losses show remarkably different optimization behavior. The number of iterations required to perfectly fit to data scales superlinearly with the amount of randomly flipped labels for the supervised contrastive loss. This is in contrast to the approximately linear scaling previously reported for networks trained with cross-entropy.