We consider expected risk minimization when the range of the estimator is required to be nonnegative, motivated by the settings of maximum likelihood estimation (MLE) and trajectory optimization. To facilitate nonlinear interpolation, we hypothesize that search is conducted over a Reproducing Kernel Hilbert Space (RKHS). To solve it, we develop first and second-order variants of stochastic mirror descent employing (i) pseudo-gradients and (ii) complexity-reducing projections. Compressive projection in first-order scheme is executed via kernel orthogonal matching pursuit (KOMP), and overcome the fact that the vanilla RKHS parameterization grows unbounded with time. Moreover, pseudo-gradients are needed when stochastic estimates of the gradient of the expected cost are only computable up to some numerical errors, which arise in, e.g., integral approximations. The second-order scheme develops a Hessian inverse approximation via recursively averaged pseudo-gradient outer products. For the first-order scheme, we establish tradeoffs between accuracy of convergence in mean and the projection budget parameter under constant step-size and compression budget are established, as well as non-asymptotic bounds on the model complexity. Analogous convergence results are established for the second-order scheme under an additional eigenvalue decay condition on the Hessian of the optimal RKHS element. Experiments demonstrate favorable performance on inhomogeneous Poisson Process intensity estimation in practice.
This paper presents a comparison of two multi-fidelity methods for the forward uncertainty quantification of a naval engineering problem. Specifically, we consider the problem of quantifying the uncertainty of the hydrodynamic resistance of a roll-on/roll-off passengers ferry advancing in calm water and subject to two operational uncertainties (ship speed and payload). The first four statistical moments (mean, variance, skewness, kurtosis), and the probability density function for such quantity of interest (QoI) are computed with two multi-fidelity methods, i.e., the Multi-Index Stochastic Collocation (MISC) method and an adaptive multi-fidelity Stochastic Radial Basis Functions (SRBF) algorithm. The QoI is evaluated via computational fluid dynamics simulations, which are performed with the in-house unsteady Reynolds-Averaged Navier-Stokes (RANS) multi-grid solver $\chi$navis. The different fidelities employed by both methods are obtained by stopping the RANS solver at different grid levels of the multi-grid cycle. The performance of both methods are presented and discussed: in a nutshell, the findings suggest that, at least for the current implementations of both algorithms, MISC could be preferred whenever a limited computational budget is available, whereas for a larger computational budget SRBFs seem to be preferable, thanks to its robustness to the numerical noise in the evaluations of the QoI.
We study the convergence properties of gradient descent for training deep linear neural networks, i.e., deep matrix factorizations, by extending a previous analysis for the related gradient flow. We show that under suitable conditions on the step sizes gradient descent converges to a critical point of the loss function, i.e., the square loss in this article. Furthermore, we demonstrate that for almost all initializations gradient descent converges to a global minimum in the case of two layers. In the case of three or more layers we show that gradient descent converges to a global minimum on the manifold matrices of some fixed rank, where the rank cannot be determined a priori.
Let $P$ be a linear differential operator over $\mathcal{D} \subset \mathbb{R}^d$ and $U = (U_x)_{x \in \mathcal{D}}$ a second order stochastic process. In the first part of this article, we prove a new simple necessary and sufficient condition for all the trajectories of $U$ to verify the partial differential equation (PDE) $T(U) = 0$. This condition is formulated in terms of the covariance kernel of $U$. The novelty of this result is that the equality $T(U) = 0$ is understood in the sense of distributions, which is a functional analysis framework particularly adapted to the study of PDEs. This theorem provides precious insights during the second part of this article, which is dedicated to performing "physically informed" machine learning on data that is solution to the homogeneous 3 dimensional free space wave equation. We perform Gaussian Process Regression (GPR) on this data, which is a kernel based machine learning technique. To do so, we model the solution of this PDE as a trajectory drawn from a well-chosen Gaussian process (GP). We obtain explicit formulas for the covariance kernel of the corresponding stochastic process; this kernel can then be used for GPR. We explore two particular cases : the radial symmetry and the point source. In the case of radial symmetry, we derive "fast to compute" GPR formulas; in the case of the point source, we show a direct link between GPR and the classical triangulation method for point source localization used e.g. in GPS systems. We also show that this use of GPR can be interpreted as a new answer to the ill-posed inverse problem of reconstructing initial conditions for the wave equation with finite dimensional data, and also provides a way of estimating physical parameters from this data as in [Raissi et al,2017]. We finish by showcasing this physically informed GPR on a number of practical examples.
We consider distributed optimization under communication constraints for training deep learning models. We propose a new algorithm, whose parameter updates rely on two forces: a regular gradient step, and a corrective direction dictated by the currently best-performing worker (leader). Our method differs from the parameter-averaging scheme EASGD in a number of ways: (i) our objective formulation does not change the location of stationary points compared to the original optimization problem; (ii) we avoid convergence decelerations caused by pulling local workers descending to different local minima to each other (i.e. to the average of their parameters); (iii) our update by design breaks the curse of symmetry (the phenomenon of being trapped in poorly generalizing sub-optimal solutions in symmetric non-convex landscapes); and (iv) our approach is more communication efficient since it broadcasts only parameters of the leader rather than all workers. We provide theoretical analysis of the batch version of the proposed algorithm, which we call Leader Gradient Descent (LGD), and its stochastic variant (LSGD). Finally, we implement an asynchronous version of our algorithm and extend it to the multi-leader setting, where we form groups of workers, each represented by its own local leader (the best performer in a group), and update each worker with a corrective direction comprised of two attractive forces: one to the local, and one to the global leader (the best performer among all workers). The multi-leader setting is well-aligned with current hardware architecture, where local workers forming a group lie within a single computational node and different groups correspond to different nodes. For training convolutional neural networks, we empirically demonstrate that our approach compares favorably to state-of-the-art baselines.
In gradient descent, changing how we parametrize the model can lead to drastically different optimization trajectories, giving rise to a surprising range of meaningful inductive biases: identifying sparse classifiers or reconstructing low-rank matrices without explicit regularization. This implicit regularization has been hypothesised to be a contributing factor to good generalization in deep learning. However, natural gradient descent is approximately invariant to reparameterization, it always follows the same trajectory and finds the same optimum. The question naturally arises: What happens if we eliminate the role of parameterization, which solution will be found, what new properties occur? We characterize the behaviour of natural gradient flow in deep linear networks for separable classification under logistic loss and deep matrix factorization. Some of our findings extend to nonlinear neural networks with sufficient but finite over-parametrization. We demonstrate that there exist learning problems where natural gradient descent fails to generalize, while gradient descent with the right architecture performs well.
This book develops an effective theory approach to understanding deep neural networks of practical relevance. Beginning from a first-principles component-level picture of networks, we explain how to determine an accurate description of the output of trained networks by solving layer-to-layer iteration equations and nonlinear learning dynamics. A main result is that the predictions of networks are described by nearly-Gaussian distributions, with the depth-to-width aspect ratio of the network controlling the deviations from the infinite-width Gaussian description. We explain how these effectively-deep networks learn nontrivial representations from training and more broadly analyze the mechanism of representation learning for nonlinear models. From a nearly-kernel-methods perspective, we find that the dependence of such models' predictions on the underlying learning algorithm can be expressed in a simple and universal way. To obtain these results, we develop the notion of representation group flow (RG flow) to characterize the propagation of signals through the network. By tuning networks to criticality, we give a practical solution to the exploding and vanishing gradient problem. We further explain how RG flow leads to near-universal behavior and lets us categorize networks built from different activation functions into universality classes. Altogether, we show that the depth-to-width ratio governs the effective model complexity of the ensemble of trained networks. By using information-theoretic techniques, we estimate the optimal aspect ratio at which we expect the network to be practically most useful and show how residual connections can be used to push this scale to arbitrary depths. With these tools, we can learn in detail about the inductive bias of architectures, hyperparameters, and optimizers.
We investigate how the final parameters found by stochastic gradient descent are influenced by over-parameterization. We generate families of models by increasing the number of channels in a base network, and then perform a large hyper-parameter search to study how the test error depends on learning rate, batch size, and network width. We find that the optimal SGD hyper-parameters are determined by a "normalized noise scale," which is a function of the batch size, learning rate, and initialization conditions. In the absence of batch normalization, the optimal normalized noise scale is directly proportional to width. Wider networks, with their higher optimal noise scale, also achieve higher test accuracy. These observations hold for MLPs, ConvNets, and ResNets, and for two different parameterization schemes ("Standard" and "NTK"). We observe a similar trend with batch normalization for ResNets. Surprisingly, since the largest stable learning rate is bounded, the largest batch size consistent with the optimal normalized noise scale decreases as the width increases.
We study the problem of training deep neural networks with Rectified Linear Unit (ReLU) activiation function using gradient descent and stochastic gradient descent. In particular, we study the binary classification problem and show that for a broad family of loss functions, with proper random weight initialization, both gradient descent and stochastic gradient descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under mild assumption on the training data. The key idea of our proof is that Gaussian random initialization followed by (stochastic) gradient descent produces a sequence of iterates that stay inside a small perturbation region centering around the initial weights, in which the empirical loss function of deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of (stochastic) gradient descent. Our theoretical results shed light on understanding the optimization of deep learning, and pave the way to study the optimization dynamics of training modern deep neural networks.
Image foreground extraction is a classical problem in image processing and vision, with a large range of applications. In this dissertation, we focus on the extraction of text and graphics in mixed-content images, and design novel approaches for various aspects of this problem. We first propose a sparse decomposition framework, which models the background by a subspace containing smooth basis vectors, and foreground as a sparse and connected component. We then formulate an optimization framework to solve this problem, by adding suitable regularizations to the cost function to promote the desired characteristics of each component. We present two techniques to solve the proposed optimization problem, one based on alternating direction method of multipliers (ADMM), and the other one based on robust regression. Promising results are obtained for screen content image segmentation using the proposed algorithm. We then propose a robust subspace learning algorithm for the representation of the background component using training images that could contain both background and foreground components, as well as noise. With the learnt subspace for the background, we can further improve the segmentation results, compared to using a fixed subspace. Lastly, we investigate a different class of signal/image decomposition problem, where only one signal component is active at each signal element. In this case, besides estimating each component, we need to find their supports, which can be specified by a binary mask. We propose a mixed-integer programming problem, that jointly estimates the two components and their supports through an alternating optimization scheme. We show the application of this algorithm on various problems, including image segmentation, video motion segmentation, and also separation of text from textured images.
Methods that align distributions by minimizing an adversarial distance between them have recently achieved impressive results. However, these approaches are difficult to optimize with gradient descent and they often do not converge well without careful hyperparameter tuning and proper initialization. We investigate whether turning the adversarial min-max problem into an optimization problem by replacing the maximization part with its dual improves the quality of the resulting alignment and explore its connections to Maximum Mean Discrepancy. Our empirical results suggest that using the dual formulation for the restricted family of linear discriminators results in a more stable convergence to a desirable solution when compared with the performance of a primal min-max GAN-like objective and an MMD objective under the same restrictions. We test our hypothesis on the problem of aligning two synthetic point clouds on a plane and on a real-image domain adaptation problem on digits. In both cases, the dual formulation yields an iterative procedure that gives more stable and monotonic improvement over time.