Approximating a function with a finite series, e.g., involving polynomials or trigonometric functions, is a critical tool in computing and data analysis. The construction of such approximations via now-standard approaches like least squares or compressive sampling does not ensure that the approximation adheres to certain convex linear structural constraints, such as positivity or monotonicity. Existing approaches that ensure such structure are norm-dissipative and this can have a deleterious impact when applying these approaches, e.g., when numerical solving partial differential equations. We present a new framework that enforces via optimization such structure on approximations and is simultaneously norm-preserving. This results in a conceptually simple convex optimization problem on the sphere, but the feasible set for such problems can be very complex. We establish well-posedness of the optimization problem through results on spherical convexity and design several spherical-projection-based algorithms to numerically compute the solution. Finally, we demonstrate the effectiveness of this approach through several numerical examples.
Numerical models of weather and climate critically depend on long-term stability of integrators for systems of hyperbolic conservation laws. While such stability is often obtained from (physical or numerical) dissipation terms, physical fidelity of such simulations also depends on properly preserving conserved quantities, such as energy, of the system. To address this apparent paradox, we develop a variational integrator for the shallow water equations that conserves energy, but dissipates potential enstrophy. Our approach follows the continuous selective decay framework [F. Gay-Balmaz and D. Holm. Selective decay by Casimir dissipation in inviscid fluids. Nonlinearity, 26(2):495, 2013], which enables dissipating an otherwise conserved quantity while conserving the total energy. We use this in combination with the variational discretization method [D. Pavlov, P. Mullen, Y. Tong, E. Kanso, J. Marsden and M. Desbrun. Structure-preserving discretization of incompressible fluids. Physica D: Nonlinear Phenomena, 240(6):443-458, 2011] to obtain a discrete selective decay framework. This is applied to the shallow water equations, both in the plane and on the sphere, to dissipate the potential enstrophy. The resulting scheme significantly improves the quality of the approximate solutions, enabling long-term integrations to be carried out.
We introduce a physically relevant stochastic representation of the rotating shallow water equations. The derivation relies mainly on a stochastic transport principle and on a decomposition of the fluid flow into a large-scale component and a noise term that models the unresolved flow components. As for the classical (deterministic) system, this scheme, referred to as modelling under location uncertainty (LU), conserves the global energy of any realization and provides the possibility to generate an ensemble of physically relevant random simulations with a good trade-off between the model error representation and the ensemble's spread. To maintain numerically the energy conservation feature, we combine an energy (in space) preserving discretization of the underlying deterministic model with approximations of the stochastic terms that are based on standard finite volume/difference operators. The LU derivation, built from the very same conservation principles as the usual geophysical models, together with the numerical scheme proposed can be directly used in existing dynamical cores of global numerical weather prediction models. The capabilities of the proposed framework is demonstrated for an inviscid test case on the f-plane and for a barotropically unstable jet on the sphere.
For a sample of Exponentially distributed durations we aim at point estimation and a confidence interval for its parameter. A duration is only observed if it has ended within a certain time interval, determined by a Uniform distribution. Hence, the data is a truncated empirical process that we can approximate by a Poisson process when only a small portion of the sample is observed, as is the case for our applications. We derive the likelihood from standard arguments for point processes, acknowledging the size of the latent sample as the second parameter, and derive the maximum likelihood estimator for both. Consistency and asymptotic normality of the estimator for the Exponential parameter are derived from standard results on M-estimation. We compare the design with a simple random sample assumption for the observed durations. Theoretically, the derivative of the log-likelihood is less steep in the truncation-design for small parameter values, indicating a larger computational effort for root finding and a larger standard error. In applications from the social and economic sciences and in simulations, we indeed, find a moderately increased standard error when acknowledging truncation.
We present a new algorithmic framework for grouped variable selection that is based on discrete mathematical optimization. While there exist several appealing approaches based on convex relaxations and nonconvex heuristics, we focus on optimal solutions for the $\ell_0$-regularized formulation, a problem that is relatively unexplored due to computational challenges. Our methodology covers both high-dimensional linear regression and nonparametric sparse additive modeling with smooth components. Our algorithmic framework consists of approximate and exact algorithms. The approximate algorithms are based on coordinate descent and local search, with runtimes comparable to popular sparse learning algorithms. Our exact algorithm is based on a standalone branch-and-bound (BnB) framework, which can solve the associated mixed integer programming (MIP) problem to certified optimality. By exploiting the problem structure, our custom BnB algorithm can solve to optimality problem instances with $5 \times 10^6$ features and $10^3$ observations in minutes to hours -- over $1000$ times larger than what is currently possible using state-of-the-art commercial MIP solvers. We also explore statistical properties of the $\ell_0$-based estimators. We demonstrate, theoretically and empirically, that our proposed estimators have an edge over popular group-sparse estimators in terms of statistical performance in various regimes. We provide an open-source implementation of our proposed framework.
We study the optimization problem associated with fitting two-layer ReLU neural networks with respect to the squared loss, where labels are generated by a target network. We make use of the rich symmetry structure to develop a novel set of tools for studying families of spurious minima. In contrast to existing approaches which operate in limiting regimes, our technique directly addresses the nonconvex loss landscape for a finite number of inputs $d$ and neurons $k$, and provides analytic, rather than heuristic, information. In particular, we derive analytic estimates for the loss at different minima, and prove that modulo $O(d^{-1/2})$-terms the Hessian spectrum concentrates near small positive constants, with the exception of $\Theta(d)$ eigenvalues which grow linearly with~$d$. We further show that the Hessian spectrum at global and spurious minima coincide to $O(d^{-1/2})$-order, thus challenging our ability to argue about statistical generalization through local curvature. Lastly, our technique provides the exact \emph{fractional} dimensionality at which families of critical points turn from saddles into spurious minima. This makes possible the study of the creation and the annihilation of spurious minima using powerful tools from equivariant bifurcation theory.
The quasi-Newton methods generally provide curvature information by approximating the Hessian using the secant equation. However, the secant equation becomes insipid in approximating the Newton step owing to its use of the first-order derivatives. In this study, we propose an approximate Newton step-based stochastic optimization algorithm for large-scale empirical risk minimization of convex functions with linear convergence rates. Specifically, we compute a partial column Hessian of size ($d\times k$) with $k\ll d$ randomly selected variables, then use the \textit{Nystr\"om method} to better approximate the full Hessian matrix. To further reduce the computational complexity per iteration, we directly compute the update step ($\Delta\boldsymbol{w}$) without computing and storing the full Hessian or its inverse. Furthermore, to address large-scale scenarios in which even computing a partial Hessian may require significant time, we used distribution-preserving (DP) sub-sampling to compute a partial Hessian. The DP sub-sampling generates $p$ sub-samples with similar first and second-order distribution statistics and selects a single sub-sample at each epoch in a round-robin manner to compute the partial Hessian. We integrate our approximated Hessian with stochastic gradient descent and stochastic variance-reduced gradients to solve the logistic regression problem. The numerical experiments show that the proposed approach was able to obtain a better approximation of Newton\textquotesingle s method with performance competitive with the state-of-the-art first-order and the stochastic quasi-Newton methods.
In this paper, we propose new structured second-order methods and structured adaptive-gradient methods obtained by performing natural-gradient descent on structured parameter spaces. Natural-gradient descent is an attractive approach to design new algorithms in many settings such as gradient-free, adaptive-gradient, and second-order methods. Our structured methods not only enjoy a structural invariance but also admit a simple expression. Finally, we test the efficiency of our proposed methods on both deterministic non-convex problems and deep learning problems.
A precision matrix is the inverse of a covariance matrix. In this paper, we study the problem of estimating the precision matrix with a known graphical structure under high-dimensional settings. We propose a simple estimator of the precision matrix based on the connection between the known graphical structure and the precision matrix. We obtain the rates of convergence of the proposed estimators and derive the asymptotic normality of the proposed estimator in the high-dimensional setting when the data dimension grows with the sample size. Numerical simulations are conducted to demonstrate the performance of the proposed method. We also show that the proposed method outperforms some existing methods that do not utilize the graphical structure information.
Graph neural networks (GNNs) are a popular class of machine learning models whose major advantage is their ability to incorporate a sparse and discrete dependency structure between data points. Unfortunately, GNNs can only be used when such a graph-structure is available. In practice, however, real-world graphs are often noisy and incomplete or might not be available at all. With this work, we propose to jointly learn the graph structure and the parameters of graph convolutional networks (GCNs) by approximately solving a bilevel program that learns a discrete probability distribution on the edges of the graph. This allows one to apply GCNs not only in scenarios where the given graph is incomplete or corrupted but also in those where a graph is not available. We conduct a series of experiments that analyze the behavior of the proposed method and demonstrate that it outperforms related methods by a significant margin.
The Variational Auto-Encoder (VAE) is one of the most used unsupervised machine learning models. But although the default choice of a Gaussian distribution for both the prior and posterior represents a mathematically convenient distribution often leading to competitive results, we show that this parameterization fails to model data with a latent hyperspherical structure. To address this issue we propose using a von Mises-Fisher (vMF) distribution instead, leading to a hyperspherical latent space. Through a series of experiments we show how such a hyperspherical VAE, or $\mathcal{S}$-VAE, is more suitable for capturing data with a hyperspherical latent structure, while outperforming a normal, $\mathcal{N}$-VAE, in low dimensions on other data types.