We consider using gradient descent to minimize the nonconvex function $f(X)=\phi(XX^{T})$ over an $n\times r$ factor matrix $X$, in which $\phi$ is an underlying smooth convex cost function defined over $n\times n$ matrices. While only a second-order stationary point $X$ can be provably found in reasonable time, if $X$ is additionally rank deficient, then its rank deficiency certifies it as being globally optimal. This way of certifying global optimality necessarily requires the search rank $r$ of the current iterate $X$ to be overparameterized with respect to the rank $r^{\star}$ of the global minimizer $X^{\star}$. Unfortunately, overparameterization significantly slows down the convergence of gradient descent, from a linear rate with $r=r^{\star}$ to a sublinear rate when $r>r^{\star}$, even when $\phi$ is strongly convex. In this paper, we propose an inexpensive preconditioner that restores the convergence rate of gradient descent back to linear in the overparameterized case, while also making it agnostic to possible ill-conditioning in the global minimizer $X^{\star}$.
LU and Cholesky matrix factorization algorithms are core subroutines used to solve systems of linear equations (SLEs) encountered while solving an optimization problem. Standard factorization algorithms are highly efficient but remain susceptible to the accumulation of roundoff errors, which can lead solvers to return feasibility and optimality claims that are actually invalid. This paper introduces a novel approach for solving sequences of closely related SLEs encountered in nonlinear programming efficiently and without roundoff errors. Specifically, it introduces rank-one update algorithms for the roundoff-error-free (REF) factorization framework, a toolset built on integer-preserving arithmetic that has led to the development and implementation of fail-proof SLE solution subroutines for linear programming. The formal guarantees of the proposed algorithms are established through the derivation of theoretical insights. Their advantages are supported with computational experiments, which demonstrate upwards of 75x-improvements over exact factorization run-times on fully dense matrices with over one million entries. A significant advantage of the methodology is that the length of any coefficient calculated via the proposed algorithms is bounded polynomially in the size of the inputs without having to resort to greatest common divisor operations, which are required by and thereby hinder an efficient implementation of exact rational arithmetic approaches.
In this paper, we study a sequential decision making problem faced by e-commerce carriers related to when to send out a vehicle from the central depot to serve customer requests, and in which order to provide the service, under the assumption that the time at which parcels arrive at the depot is stochastic and dynamic. The objective is to maximize the number of parcels that can be delivered during the service hours. We propose two reinforcement learning approaches for solving this problem, one based on a policy function approximation (PFA) and the second on a value function approximation (VFA). Both methods are combined with a look-ahead strategy, in which future release dates are sampled in a Monte-Carlo fashion and a tailored batch approach is used to approximate the value of future states. Our PFA and VFA make a good use of branch-and-cut-based exact methods to improve the quality of decisions. We also establish sufficient conditions for partial characterization of optimal policy and integrate them into PFA/VFA. In an empirical study based on 720 benchmark instances, we conduct a competitive analysis using upper bounds with perfect information and we show that PFA and VFA greatly outperform two alternative myopic approaches. Overall, PFA provides best solutions, while VFA (which benefits from a two-stage stochastic optimization model) achieves a better tradeoff between solution quality and computing time.
It remains challenging to deploy existing risk-averse approaches to real-world applications. The reasons are multi-fold, including the lack of global optimality guarantee and the necessity of learning from long-term consecutive trajectories. Long-term consecutive trajectories are prone to involving visiting hazardous states, which is a major concern in the risk-averse setting. This paper proposes Short-Term VOlatility-controlled Policy Search (STOPS), a novel algorithm that solves risk-averse problems by learning from short-term trajectories instead of long-term trajectories. Short-term trajectories are more flexible to generate, and can avoid the danger of hazardous state visitations. By using an actor-critic scheme with an overparameterized two-layer neural network, our algorithm finds a globally optimal policy at a sublinear rate with proximal policy optimization and natural policy gradient, with effectiveness comparable to the state-of-the-art convergence rate of risk-neutral policy-search methods. The algorithm is evaluated on challenging Mujoco robot simulation tasks under the mean-variance evaluation metric. Both theoretical analysis and experimental results demonstrate a state-of-the-art level of STOPS' performance among existing risk-averse policy search methods.
It was observed in \citet{gupta2009differentially} that the Set Cover problem has strong impossibility results under differential privacy. In our work, we observe that these hardness results dissolve when we turn to the Partial Set Cover problem, where we only need to cover a $\rho$-fraction of the elements in the universe, for some $\rho\in(0,1)$. We show that this relaxation enables us to avoid the impossibility results: under loose conditions on the input set system, we give differentially private algorithms which output an explicit set cover with non-trivial approximation guarantees. In particular, this is the first differentially private algorithm which outputs an explicit set cover. Using our algorithm for Partial Set Cover as a subroutine, we give a differentially private (bicriteria) approximation algorithm for a facility location problem which generalizes $k$-center/$k$-supplier with outliers. Like with the Set Cover problem, no algorithm has been able to give non-trivial guarantees for $k$-center/$k$-supplier-type facility location problems due to the high sensitivity and impossibility results. Our algorithm shows that relaxing the covering requirement to serving only a $\rho$-fraction of the population, for $\rho\in(0,1)$, enables us to circumvent the inherent hardness. Overall, our work is an important step in tackling and understanding impossibility results in private combinatorial optimization.
We derive conditions for the existence of fixed points of cone mappings without assuming scalability of functions. Monotonicity and scalability are often inseparable in the literature in the context of searching for fixed points of interference mappings. In applications, such mappings are approximated by non-negative neural networks. It turns out, however, that the process of training non-negative networks requires imposing an artificial constraint on the weights of the model. However, in the case of specific non-negative data, it cannot be said that if the mapping is non-negative, it has only non-negative weights. Therefore, we considered the problem of the existence of fixed points for general neural networks, assuming the conditions of tangency conditions with respect to specific cones. This does not relax the physical assumptions, because even assuming that the input and output are to be non-negative, the weights can have (small, but) less than zero values. Such properties (often found in papers on the interpretability of weights of neural networks) lead to the weakening of the assumptions about the monotonicity or scalability of the mapping associated with the neural network. To the best of our knowledge, this paper is the first to study this phenomenon.
Importance sampling (IS) is valuable in reducing the variance of Monte Carlo sampling for many areas, including finance, rare event simulation, and Bayesian inference. It is natural and obvious to combine quasi-Monte Carlo (QMC) methods with IS to achieve a faster rate of convergence. However, a naive replacement of Monte Carlo with QMC may not work well. This paper investigates the convergence rates of randomized QMC-based IS for estimating integrals with respect to a Gaussian measure, in which the IS measure is a Gaussian or $t$ distribution. We prove that if the target function satisfies the so-called boundary growth condition and the covariance matrix of the IS density has eigenvalues no smaller than 1, then randomized QMC with the Gaussian proposal has a root mean squared error of $O(N^{-1+\epsilon})$ for arbitrarily small $\epsilon>0$. Similar results of $t$ distribution as the proposal are also established. These sufficient conditions help to assess the effectiveness of IS in QMC. For some particular applications, we find that the Laplace IS, a very general approach to approximate the target function by a quadratic Taylor approximation around its mode, has eigenvalues smaller than 1, making the resulting integrand less favorable for QMC. From this point of view, when using Gaussian distributions as the IS proposal, a change of measure via Laplace IS may transform a favorable integrand into unfavorable one for QMC although the variance of Monte Carlo sampling is reduced. We also give some examples to verify our propositions and warn against naive replacement of MC with QMC under IS proposals. Numerical results suggest that using Laplace IS with $t$ distributions is more robust than that with Gaussian distributions.
This paper first strictly proved that the growth of the second moment of a large class of Gaussian processes is not greater than power function and the covariance matrix is strictly positive definite. Under these two conditions, the maximum likelihood estimators of the mean and variance of such classes of drift Gaussian process have strong consistency under broader growth of t_n. At the same time, the asymptotic normality of binary random vectors and the Berry-Ess\'{e}en bound of estimators are obtained by using the Stein's method via Malliavian calculus.
The rapid recent progress in machine learning (ML) has raised a number of scientific questions that challenge the longstanding dogma of the field. One of the most important riddles is the good empirical generalization of overparameterized models. Overparameterized models are excessively complex with respect to the size of the training dataset, which results in them perfectly fitting (i.e., interpolating) the training data, which is usually noisy. Such interpolation of noisy data is traditionally associated with detrimental overfitting, and yet a wide range of interpolating models -- from simple linear models to deep neural networks -- have recently been observed to generalize extremely well on fresh test data. Indeed, the recently discovered double descent phenomenon has revealed that highly overparameterized models often improve over the best underparameterized model in test performance. Understanding learning in this overparameterized regime requires new theory and foundational empirical studies, even for the simplest case of the linear model. The underpinnings of this understanding have been laid in very recent analyses of overparameterized linear regression and related statistical learning tasks, which resulted in precise analytic characterizations of double descent. This paper provides a succinct overview of this emerging theory of overparameterized ML (henceforth abbreviated as TOPML) that explains these recent findings through a statistical signal processing perspective. We emphasize the unique aspects that define the TOPML research area as a subfield of modern ML theory and outline interesting open questions that remain.
This book develops an effective theory approach to understanding deep neural networks of practical relevance. Beginning from a first-principles component-level picture of networks, we explain how to determine an accurate description of the output of trained networks by solving layer-to-layer iteration equations and nonlinear learning dynamics. A main result is that the predictions of networks are described by nearly-Gaussian distributions, with the depth-to-width aspect ratio of the network controlling the deviations from the infinite-width Gaussian description. We explain how these effectively-deep networks learn nontrivial representations from training and more broadly analyze the mechanism of representation learning for nonlinear models. From a nearly-kernel-methods perspective, we find that the dependence of such models' predictions on the underlying learning algorithm can be expressed in a simple and universal way. To obtain these results, we develop the notion of representation group flow (RG flow) to characterize the propagation of signals through the network. By tuning networks to criticality, we give a practical solution to the exploding and vanishing gradient problem. We further explain how RG flow leads to near-universal behavior and lets us categorize networks built from different activation functions into universality classes. Altogether, we show that the depth-to-width ratio governs the effective model complexity of the ensemble of trained networks. By using information-theoretic techniques, we estimate the optimal aspect ratio at which we expect the network to be practically most useful and show how residual connections can be used to push this scale to arbitrary depths. With these tools, we can learn in detail about the inductive bias of architectures, hyperparameters, and optimizers.
Non-convex optimization is ubiquitous in modern machine learning. Researchers devise non-convex objective functions and optimize them using off-the-shelf optimizers such as stochastic gradient descent and its variants, which leverage the local geometry and update iteratively. Even though solving non-convex functions is NP-hard in the worst case, the optimization quality in practice is often not an issue -- optimizers are largely believed to find approximate global minima. Researchers hypothesize a unified explanation for this intriguing phenomenon: most of the local minima of the practically-used objectives are approximately global minima. We rigorously formalize it for concrete instances of machine learning problems.