This article presents a new way to study the theory of regularized learning for generalized data in Banach spaces including representer theorems and convergence theorems. The generalized data are composed of linear functionals and real scalars as the input and output elements to represent the discrete information of many engineering and physics models. By the extension of the classical machine learning, the empirical risks are computed by the generalized data and the loss functions. According to the techniques of regularization, the exact solutions are approximated by minimizing the regularized empirical risks over the Banach spaces. The existence and convergence of the approximate solutions are guaranteed by the relative compactness of the generalized input data in the predual spaces of the Banach spaces.
We introduce a new approach for quantum linear algebra based on quantum subspace states and present three new quantum machine learning algorithms. The first is a quantum determinant sampling algorithm that samples from the distribution $\Pr[S]= det(X_{S}X_{S}^{T})$ for $|S|=d$ using $O(nd)$ gates and with circuit depth $O(d\log n)$. The state of art classical algorithm for the task requires $O(d^{3})$ operations \cite{derezinski2019minimax}. The second is a quantum singular value estimation algorithm for compound matrices $\mathcal{A}^{k}$, the speedup for this algorithm is potentially exponential. It decomposes a $\binom{n}{k}$ dimensional vector of order-$k$ correlations into a linear combination of subspace states corresponding to $k$-tuples of singular vectors of $A$. The third algorithm reduces exponentially the depth of circuits used in quantum topological data analysis from $O(n)$ to $O(\log n)$. Our basic tool are quantum subspace states, defined as $|Col(X)\rangle = \sum_{S\subset [n], |S|=d} det(X_{S}) |S\rangle$ for matrices $X \in \mathbb{R}^{n \times d}$ such that $X^{T} X = I_{d}$, that encode $d$-dimensional subspaces of $\mathbb{R}^{n}$. We develop two efficient state preparation techniques, the first using Givens circuits uses the representation of a subspace as a sequence of Givens rotations, while the second uses efficient implementations of unitaries $\Gamma(x) = \sum_{i} x_{i} Z^{\otimes (i-1)} \otimes X \otimes I^{n-i}$ with $O(\log n)$ depth circuits that we term Clifford loaders.
In the last years predict-and-optimize approaches, also known as decision-focussed learning, have received increasing attention. In this setting, the predictions of machine learning models are used as estimated cost coefficients in the objective function of a discrete combinatorial optimization problems for decision making. Predict-and-optimize approaches propose to train the ML models, often neural network models, by directly optimizing the quality of decisions made by the optimization solvers. Based on recent work that proposed a Noise Contrastive Estimation loss over a subset of the solution space, we observe that predict-and-optimize can more generally be seen as a learning-to-rank problem. That is, the goal is to learn an objective function that ranks the feasible points correctly. This approach is independent of the optimization method used and of the form of the objective function. We develop pointwise, pairwise and listwise ranking loss functions, which can be differentiated in closed form given a subset of solutions. We empirically investigate the quality of our generic method compared to existing predict-and-optimize approaches with competitive results. Furthermore, controlling the subset of solutions allows controlling the runtime considerably, with limited effect on regret.
We propose a unified framework to study policy evaluation (PE) and the associated temporal difference (TD) methods for reinforcement learning in continuous time and space. We show that PE is equivalent to maintaining the martingale condition of a process. From this perspective, we find that the mean--square TD error approximates the quadratic variation of the martingale and thus is not a suitable objective for PE. We present two methods to use the martingale characterization for designing PE algorithms. The first one minimizes a "martingale loss function", whose solution is proved to be the best approximation of the true value function in the mean--square sense. This method interprets the classical gradient Monte-Carlo algorithm. The second method is based on a system of equations called the "martingale orthogonality conditions" with test functions. Solving these equations in different ways recovers various classical TD algorithms, such as TD($\lambda$), LSTD, and GTD. Different choices of test functions determine in what sense the resulting solutions approximate the true value function. Moreover, we prove that any convergent time-discretized algorithm converges to its continuous-time counterpart as the mesh size goes to zero, and we provide the convergence rate. We demonstrate the theoretical results and corresponding algorithms with numerical experiments and applications.
Many problems in computational science and engineering can be described in terms of approximating a smooth function of $d$ variables, defined over an unknown domain of interest $\Omega\subset \mathbb{R}^d$, from sample data. Here both the curse of dimensionality ($d\gg 1$) and the lack of domain knowledge with $\Omega$ potentially irregular and/or disconnected are confounding factors for sampling-based methods. Na\"{i}ve approaches often lead to wasted samples and inefficient approximation schemes. For example, uniform sampling can result in upwards of 20\% wasted samples in some problems. In surrogate model construction in computational uncertainty quantification (UQ), the high cost of computing samples needs a more efficient sampling procedure. In the last years, methods for computing such approximations from sample data have been studied in the case of irregular domains. The advantages of computing sampling measures depending on an approximation space $P$ of $\dim(P)=N$ have been shown. In particular, such methods confer advantages such as stability and well-conditioning, with $\mathcal{O}(N\log(N))$ as sample complexity. The recently-proposed adaptive sampling for general domains (ASGD) strategy is one method to construct these sampling measures. The main contribution of this paper is to improve ASGD by adaptively updating the sampling measures over unknown domains. We achieve this by first introducing a general domain adaptivity strategy (GDAS), which approximates the function and domain of interest from sample points. Second, we propose adaptive sampling for unknown domains (ASUD), which generates sampling measures over a domain that may not be known in advance. Then, we derive least squares techniques for polynomial approximation on unknown domains. Numerical results show that the ASUD approach can reduce the computational cost by as 50\% when compared with uniform sampling.
In this manuscript, we offer a gentle review of submodularity and supermodularity and their properties. We offer a plethora of submodular definitions; a full description of a number of example submodular functions and their generalizations; example discrete constraints; a discussion of basic algorithms for maximization, minimization, and other operations; a brief overview of continuous submodular extensions; and some historical applications. We then turn to how submodularity is useful in machine learning and artificial intelligence. This includes summarization, and we offer a complete account of the differences between and commonalities amongst sketching, coresets, extractive and abstractive summarization in NLP, data distillation and condensation, and data subset selection and feature selection. We discuss a variety of ways to produce a submodular function useful for machine learning, including heuristic hand-crafting, learning or approximately learning a submodular function or aspects thereof, and some advantages of the use of a submodular function as a coreset producer. We discuss submodular combinatorial information functions, and how submodularity is useful for clustering, data partitioning, parallel machine learning, active and semi-supervised learning, probabilistic modeling, and structured norms and loss functions.
From the sampling of data to the initialisation of parameters, randomness is ubiquitous in modern Machine Learning practice. Understanding the statistical fluctuations engendered by the different sources of randomness in prediction is therefore key to understanding robust generalisation. In this manuscript we develop a quantitative and rigorous theory for the study of fluctuations in an ensemble of generalised linear models trained on different, but correlated, features in high-dimensions. In particular, we provide a complete description of the asymptotic joint distribution of the empirical risk minimiser for generic convex loss and regularisation in the high-dimensional limit. Our result encompasses a rich set of classification and regression tasks, such as the lazy regime of overparametrised neural networks, or equivalently the random features approximation of kernels. While allowing to study directly the mitigating effect of ensembling (or bagging) on the bias-variance decomposition of the test error, our analysis also helps disentangle the contribution of statistical fluctuations, and the singular role played by the interpolation threshold that are at the roots of the "double-descent" phenomenon.
Multi-view learning is frequently used in data science. The pairwise correlation maximization is a classical approach for exploring the consensus of multiple views. Since the pairwise correlation is inherent for two views, the extensions to more views can be diversified and the intrinsic interconnections among views are generally lost. To address this issue, we propose to maximize higher order correlations. This can be formulated as a low rank approximation problem with the higher order correlation tensor of multi-view data. We use the generating polynomial method to solve the low rank approximation problem. Numerical results on real multi-view data demonstrate that this method consistently outperforms prior existing methods.
Matter evolved under influence of gravity from minuscule density fluctuations. Non-perturbative structure formed hierarchically over all scales, and developed non-Gaussian features in the Universe, known as the Cosmic Web. To fully understand the structure formation of the Universe is one of the holy grails of modern astrophysics. Astrophysicists survey large volumes of the Universe and employ a large ensemble of computer simulations to compare with the observed data in order to extract the full information of our own Universe. However, to evolve trillions of galaxies over billions of years even with the simplest physics is a daunting task. We build a deep neural network, the Deep Density Displacement Model (hereafter D$^3$M), to predict the non-linear structure formation of the Universe from simple linear perturbation theory. Our extensive analysis, demonstrates that D$^3$M outperforms the second order perturbation theory (hereafter 2LPT), the commonly used fast approximate simulation method, in point-wise comparison, 2-point correlation, and 3-point correlation. We also show that D$^3$M is able to accurately extrapolate far beyond its training data, and predict structure formation for significantly different cosmological parameters. Our study proves, for the first time, that deep learning is a practical and accurate alternative to approximate simulations of the gravitational structure formation of the Universe.
Importance sampling is one of the most widely used variance reduction strategies in Monte Carlo rendering. In this paper, we propose a novel importance sampling technique that uses a neural network to learn how to sample from a desired density represented by a set of samples. Our approach considers an existing Monte Carlo rendering algorithm as a black box. During a scene-dependent training phase, we learn to generate samples with a desired density in the primary sample space of the rendering algorithm using maximum likelihood estimation. We leverage a recent neural network architecture that was designed to represent real-valued non-volume preserving ('Real NVP') transformations in high dimensional spaces. We use Real NVP to non-linearly warp primary sample space and obtain desired densities. In addition, Real NVP efficiently computes the determinant of the Jacobian of the warp, which is required to implement the change of integration variables implied by the warp. A main advantage of our approach is that it is agnostic of underlying light transport effects, and can be combined with many existing rendering techniques by treating them as a black box. We show that our approach leads to effective variance reduction in several practical scenarios.
Deep learning (DL) is a high dimensional data reduction technique for constructing high-dimensional predictors in input-output models. DL is a form of machine learning that uses hierarchical layers of latent features. In this article, we review the state-of-the-art of deep learning from a modeling and algorithmic perspective. We provide a list of successful areas of applications in Artificial Intelligence (AI), Image Processing, Robotics and Automation. Deep learning is predictive in its nature rather then inferential and can be viewed as a black-box methodology for high-dimensional function estimation.