We analyse adversarial bandit convex optimisation with an adversary that is restricted to playing functions of the form $f_t(x) = g_t(\langle x, \theta\rangle)$ for convex $g_t : \mathbb R \to \mathbb R$ and unknown $\theta \in \mathbb R^d$ that is homogeneous over time. We provide a short information-theoretic proof that the minimax regret is at most $O(d \sqrt{n} \log(n \operatorname{diam}(\mathcal K)))$ where $n$ is the number of interactions, $d$ the dimension and $\operatorname{diam}(\mathcal K)$ is the diameter of the constraint set.
Stochastic gradient descent (SGD) exhibits strong algorithmic regularization effects in practice, which has been hypothesized to play an important role in the generalization of modern machine learning approaches. In this work, we seek to understand these issues in the simpler setting of linear regression (including both underparameterized and overparameterized regimes), where our goal is to make sharp instance-based comparisons of the implicit regularization afforded by (unregularized) average SGD with the explicit regularization of ridge regression. For a broad class of least squares problem instances (that are natural in high-dimensional settings), we show: (1) for every problem instance and for every ridge parameter, (unregularized) SGD, when provided with logarithmically more samples than that provided to the ridge algorithm, generalizes no worse than the ridge solution (provided SGD uses a tuned constant stepsize); (2) conversely, there exist instances (in this wide problem class) where optimally-tuned ridge regression requires quadratically more samples than SGD in order to have the same generalization performance. Taken together, our results show that, up to the logarithmic factors, the generalization performance of SGD is always no worse than that of ridge regression in a wide range of overparameterized problems, and, in fact, could be much better for some problem instances. More generally, our results show how algorithmic regularization has important consequences even in simpler (overparameterized) convex settings.
Several decades ago the Proximal Point Algorithm (PPA) started to gain much attraction for both abstract operator theory and the numerical optimization communities. Even in modern applications, researchers still use proximal minimization theory to design scalable algorithms that overcome nonsmoothness in high dimensional models. Several remarkable references as \cite{Fer:91,Ber:82constrained,Ber:89parallel,Tom:11} analyzed the tight local relations between the convergence rate of PPA and the regularity of the objective function. However, without taking into account the concrete computational effort paid for computing each PPA iteration, any iteration complexity remains abstract and purely informative. In this manuscript we aim to evaluate the computational complexity of practical PPA in terms of (proximal) gradient/subgradient iterations, which might allow a fair positioning of the famous PPA numerical performance in the class of first order methods. First, we derive nonasymptotic iteration complexity estimates of exact and inexact PPA to minimize convex functions under $\gamma-$Holderian growth: $\BigO{\log(1/\epsilon)}$ (for $\gamma \in [1,2]$) and $\BigO{1/\epsilon^{\gamma - 2}}$ (for $\gamma > 2$). In particular, we recover well-known results on exact PPA: finite convergence for sharp minima and linear convergence for quadratic growth, even under presence of inexactness. Second, assuming that an usual (proximal) gradient/subgradient method subroutine is employed to compute inexact PPA iteration, we show novel computational complexity bounds on a restarted variant of the inexact PPA, available when no information on the growth of the objective function is known. In the numerical experiments we confirm the practical performance and implementability of our schemes.
When a physical system is modeled by a nonlinear function, the unknown parameters can be estimated by fitting experimental observations by a least-squares approach. Newton's method and its variants are often used to solve problems of this type. In this paper, we are concerned with the computation of the minimal-norm solution of an underdetermined nonlinear least-squares problem. We present a Gauss-Newton type method, which relies on two relaxation parameters to ensure convergence, and which incorporates a procedure to dynamically estimate the two parameters, as well as the rank of the Jacobian matrix, along the iterations. Numerical results are presented.
This paper presents and analyzes an immersed finite element (IFE) method for solving Stokes interface problems with a piecewise constant viscosity coefficient that has a jump across the interface. In the method, the triangulation does not need to fit the interface and the IFE spaces are constructed from the traditional $CR$-$P_0$ element with modifications near the interface according to the interface jump conditions. We prove that the IFE basis functions are unisolvent on arbitrary interface elements and the IFE spaces have the optimal approximation capabilities, although the proof is challenging due to the coupling of the velocity and the pressure. The stability and the optimal error estimates of the proposed IFE method are also derived rigorously. The constants in the error estimates are shown to be independent of the interface location relative to the triangulation. Numerical examples are provided to verify the theoretical results.
After a Hessian computation, we quickly prove the 3D simplex mean width conjecture using classical methods. Then, we generalize some components to $d$ dimensions.
It is well known that bridge regression enjoys superior theoretical properties than traditional LASSO. However, the current latent variable representation of its Bayesian counterpart, based on the exponential power prior, is computationally expensive in higher dimensions. In this paper, we show that the exponential power prior has a closed-form scale mixture of normal decomposition for $\alpha=(\frac{1}{2})^\gamma, \gamma \in \mathbb{N}^+$. We develop a partially collapsed Gibbs sampling scheme, which outperforms existing Markov chain Monte Carlo strategies, we also study theoretical properties under this prior when $p>n$. In addition, we introduce a non-separable bridge penalty function inspired by the fully Bayesian formulation and a novel, efficient, coordinate-descent algorithm. We prove the algorithm's convergence and show that the local minimizer from our optimization algorithm has an oracle property. Finally, simulation studies were carried out to illustrate the performance of the new algorithms.
We consider the mathematical analysis and numerical approximation of a system of nonlinear partial differential equations that arises in models that have relevance to steady isochoric flows of colloidal suspensions. The symmetric velocity gradient is assumed to be a monotone nonlinear function of the deviatoric part of the Cauchy stress tensor. We prove the existence of a unique weak solution to the problem, and under the additional assumption that the nonlinearity involved in the constitutive relation is Lipschitz continuous we also prove uniqueness of the weak solution. We then construct mixed finite element approximations of the system using both conforming and nonconforming finite element spaces. For both of these we prove the convergence of the method to the unique weak solution of the problem, and in the case of the conforming method we provide a bound on the error between the analytical solution and its finite element approximation in terms of the best approximation error from the finite element spaces. We propose first a Lions-Mercier type iterative method and next a classical fixed-point algorithm to solve the finite-dimensional problems resulting from the finite element discretisation of the system of nonlinear partial differential equations under consideration and present numerical experiments that illustrate the practical performance of the proposed numerical method.
Despite their overwhelming capacity to overfit, deep neural networks trained by specific optimization algorithms tend to generalize well to unseen data. Recently, researchers explained it by investigating the implicit regularization effect of optimization algorithms. A remarkable progress is the work (Lyu&Li, 2019), which proves gradient descent (GD) maximizes the margin of homogeneous deep neural networks. Except GD, adaptive algorithms such as AdaGrad, RMSProp and Adam are popular owing to their rapid training process. However, theoretical guarantee for the generalization of adaptive optimization algorithms is still lacking. In this paper, we study the implicit regularization of adaptive optimization algorithms when they are optimizing the logistic loss on homogeneous deep neural networks. We prove that adaptive algorithms that adopt exponential moving average strategy in conditioner (such as Adam and RMSProp) can maximize the margin of the neural network, while AdaGrad that directly sums historical squared gradients in conditioner can not. It indicates superiority on generalization of exponential moving average strategy in the design of the conditioner. Technically, we provide a unified framework to analyze convergent direction of adaptive optimization algorithms by constructing novel adaptive gradient flow and surrogate margin. Our experiments can well support the theoretical findings on convergent direction of adaptive optimization algorithms.
We consider the exploration-exploitation trade-off in reinforcement learning and we show that an agent imbued with a risk-seeking utility function is able to explore efficiently, as measured by regret. The parameter that controls how risk-seeking the agent is can be optimized exactly, or annealed according to a schedule. We call the resulting algorithm K-learning and show that the corresponding K-values are optimistic for the expected Q-values at each state-action pair. The K-values induce a natural Boltzmann exploration policy for which the `temperature' parameter is equal to the risk-seeking parameter. This policy achieves an expected regret bound of $\tilde O(L^{3/2} \sqrt{S A T})$, where $L$ is the time horizon, $S$ is the number of states, $A$ is the number of actions, and $T$ is the total number of elapsed time-steps. This bound is only a factor of $L$ larger than the established lower bound. K-learning can be interpreted as mirror descent in the policy space, and it is similar to other well-known methods in the literature, including Q-learning, soft-Q-learning, and maximum entropy policy gradient, and is closely related to optimism and count based exploration methods. K-learning is simple to implement, as it only requires adding a bonus to the reward at each state-action and then solving a Bellman equation. We conclude with a numerical example demonstrating that K-learning is competitive with other state-of-the-art algorithms in practice.
In this work, we consider the distributed optimization of non-smooth convex functions using a network of computing units. We investigate this problem under two regularity assumptions: (1) the Lipschitz continuity of the global objective function, and (2) the Lipschitz continuity of local individual functions. Under the local regularity assumption, we provide the first optimal first-order decentralized algorithm called multi-step primal-dual (MSPD) and its corresponding optimal convergence rate. A notable aspect of this result is that, for non-smooth functions, while the dominant term of the error is in $O(1/\sqrt{t})$, the structure of the communication network only impacts a second-order term in $O(1/t)$, where $t$ is time. In other words, the error due to limits in communication resources decreases at a fast rate even in the case of non-strongly-convex objective functions. Under the global regularity assumption, we provide a simple yet efficient algorithm called distributed randomized smoothing (DRS) based on a local smoothing of the objective function, and show that DRS is within a $d^{1/4}$ multiplicative factor of the optimal convergence rate, where $d$ is the underlying dimension.