The subgradient method is one of the most fundamental algorithmic schemes for nonsmooth optimization. The existing complexity and convergence results for this algorithm are mainly derived for Lipschitz continuous objective functions. In this work, we first extend the typical complexity results for the subgradient method to convex and weakly convex minimization without assuming Lipschitz continuity. Specifically, we establish $\mathcal{O}(1/\sqrt{T})$ bound in terms of the suboptimality gap ``$f(x) - f^*$'' for convex case and $\mathcal{O}(1/{T}^{1/4})$ bound in terms of the gradient of the Moreau envelope function for weakly convex case. Furthermore, we provide convergence results for non-Lipschitz convex and weakly convex objective functions using proper diminishing rules on the step sizes. In particular, when $f$ is convex, we show $\mathcal{O}(\log(k)/\sqrt{k})$ rate of convergence in terms of the suboptimality gap. With an additional quadratic growth condition, the rate is improved to $\mathcal{O}(1/k)$ in terms of the squared distance to the optimal solution set. When $f$ is weakly convex, asymptotic convergence is derived. The central idea is that the dynamics of properly chosen step sizes rule fully controls the movement of the subgradient method, which leads to boundedness of the iterates, and then a trajectory-based analysis can be conducted to establish the desired results. To further illustrate the wide applicability of our framework, we extend the complexity results to the truncated subgradient, the stochastic subgradient, the incremental subgradient, and the proximal subgradient methods for non-Lipschitz functions.
The strong convergence of numerical methods for stochastic differential equations (SDEs) for $t\in[0,\infty)$ is proved. The result is applicable to any one-step numerical methods with Markov property that have the finite time strong convergence and the uniformly bounded moment. In addition, the convergence of the numerical stationary distribution to the underlying one can be derived from this result. To demonstrate the application of this result, the strong convergence in the infinite horizon of the backward Euler-Maruyama method in the $L^p$ sense for some small $p\in (0,1)$ is proved for SDEs with super-linear coefficients, which is also a a standalone new result. Numerical simulations are provided to illustrate the theoretical results.
We study the problem of estimating the convex hull of the image $f(X)\subset\mathbb{R}^n$ of a compact set $X\subset\mathbb{R}^m$ with smooth boundary through a smooth function $f:\mathbb{R}^m\to\mathbb{R}^n$. Assuming that $f$ is a submersion, we derive a new bound on the Hausdorff distance between the convex hull of $f(X)$ and the convex hull of the images $f(x_i)$ of $M$ sampled inputs $x_i$ on the boundary of $X$. When applied to the problem of geometric inference from a random sample, our results give tighter and more general error bounds than the state of the art. We present applications to the problems of robust optimization, of reachability analysis of dynamical systems, and of robust trajectory optimization under bounded uncertainty.
In this paper, we provide a novel framework for the analysis of generalization error of first-order optimization algorithms for statistical learning when the gradient can only be accessed through partial observations given by an oracle. Our analysis relies on the regularity of the gradient w.r.t. the data samples, and allows to derive near matching upper and lower bounds for the generalization error of multiple learning problems, including supervised learning, transfer learning, robust learning, distributed learning and communication efficient learning using gradient quantization. These results hold for smooth and strongly-convex optimization problems, as well as smooth non-convex optimization problems verifying a Polyak-Lojasiewicz assumption. In particular, our upper and lower bounds depend on a novel quantity that extends the notion of conditional standard deviation, and is a measure of the extent to which the gradient can be approximated by having access to the oracle. As a consequence, our analysis provides a precise meaning to the intuition that optimization of the statistical learning objective is as hard as the estimation of its gradient. Finally, we show that, in the case of standard supervised learning, mini-batch gradient descent with increasing batch sizes and a warm start can reach a generalization error that is optimal up to a multiplicative factor, thus motivating the use of this optimization scheme in practical applications.
A graph $G$ is \emph{locally irregular} if no two of its adjacent vertices have the same degree. In [Fioravantes et al. Complexity of finding maximum locally irregular induced subgraph. {\it SWAT}, 2022], the authors introduced and studied the problem of finding a locally irregular induced subgraph of a given a graph $G$ of maximum order, or, equivalently, computing a subset $S$ of $V(G)$ of minimum order, whose deletion from $G$ results in a locally irregular graph; $S$ is denoted as an \emph{optimal vertex-irregulator of $G$}. In this work we provide an in-depth analysis of the parameterised complexity of computing an optimal vertex-irregulator of a given graph $G$. Moreover, we introduce and study a variation of this problem, where $S$ is a substet of the edges of $G$; in this case, $S$ is denoted as an \emph{optimal edge-irregulator of $G$}. In particular, we prove that computing an optimal vertex-irregulator of a graph $G$ is in FPT when parameterised by the vertex integrity, neighborhood diversity or cluster deletion number of $G$, while it is $W[1]$-hard when parameterised by the feedback vertex set number or the treedepth of $G$. In the case of computing an optimal edge-irregulator of a graph $G$, we prove that this problem is in FPT when parameterised by the vertex integrity of $G$, while it is NP-hard even if $G$ is a planar bipartite graph of maximum degree $4$, and $W[1]$-hard when parameterised by the size of the solution, the feedback vertex set or the treedepth of $G$. Our results paint a comprehensive picture of the tractability of both problems studied here, considering most of the standard graph-structural parameters.
It is well known that the Euler method for approximating the solutions of a random ordinary differential equation $\mathrm{d}X_t/\mathrm{d}t = f(t, X_t, Y_t)$ driven by a stochastic process $\{Y_t\}_t$ with $\theta$-H\"older sample paths is estimated to be of strong order $\theta$ with respect to the time step, provided $f=f(t, x, y)$ is sufficiently regular and with suitable bounds. Here, it is proved that, in many typical cases, further conditions on the noise can be exploited so that the strong convergence is actually of order 1, regardless of the H\"older regularity of the sample paths. This applies for instance to additive or multiplicative It\^o process noises (such as Wiener, Ornstein-Uhlenbeck, and geometric Brownian motion processes); to point-process noises (such as Poisson point processes and Hawkes self-exciting processes, which even have jump-type discontinuities); and to transport-type processes with sample paths of bounded variation. The result is based on a novel approach, estimating the global error as an iterated integral over both large and small mesh scales, and switching the order of integration to move the critical regularity to the large scale. The work is complemented with numerical simulations illustrating the strong order 1 convergence in those cases, and with an example with fractional Brownian motion noise with Hurst parameter $0 < H < 1/2$ for which the order of convergence is $H + 1/2$, hence lower than the attained order 1 in the examples above, but still higher than the order $H$ of convergence expected from previous works.
We provide a rigorous analysis of training by variational inference (VI) of Bayesian neural networks in the two-layer and infinite-width case. We consider a regression problem with a regularized evidence lower bound (ELBO) which is decomposed into the expected log-likelihood of the data and the Kullback-Leibler (KL) divergence between the a priori distribution and the variational posterior. With an appropriate weighting of the KL, we prove a law of large numbers for three different training schemes: (i) the idealized case with exact estimation of a multiple Gaussian integral from the reparametrization trick, (ii) a minibatch scheme using Monte Carlo sampling, commonly known as Bayes by Backprop, and (iii) a new and computationally cheaper algorithm which we introduce as Minimal VI. An important result is that all methods converge to the same mean-field limit. Finally, we illustrate our results numerically and discuss the need for the derivation of a central limit theorem.
We consider additive Schwarz methods for boundary value problems involving the $p$-Laplacian. Although the existing theoretical estimates indicate a sublinear convergence rate for these methods, empirical evidence from numerical experiments demonstrates a linear convergence rate. In this paper, we narrow the gap between these theoretical and empirical results by presenting a novel convergence analysis. Firstly, we present an abstract convergence theory of additive Schwarz methods written in terms of a quasi-norm. This quasi-norm exhibits behavior similar to the Bregman distance of the convex energy functional associated to the problem. Secondly, we provide a quasi-norm version of the Poincar'{e}--Friedrichs inequality, which is essential for deriving a quasi-norm stable decomposition for a two-level domain decomposition setting. By utilizing these two key elements, we establish a new bound for the linear convergence rate of the methods.
We show the sup-norm convergence of deep neural network estimators with a novel adversarial training scheme. For the nonparametric regression problem, it has been shown that an estimator using deep neural networks can achieve better performances in the sense of the $L2$-norm. In contrast, it is difficult for the neural estimator with least-squares to achieve the sup-norm convergence, due to the deep structure of neural network models. In this study, we develop an adversarial training scheme and investigate the sup-norm convergence of deep neural network estimators. First, we find that ordinary adversarial training makes neural estimators inconsistent. Second, we show that a deep neural network estimator achieves the optimal rate in the sup-norm sense by the proposed adversarial training with correction. We extend our adversarial training to general setups of a loss function and a data-generating function. Our experiments support the theoretical findings.
In this paper, we consider a new approach for semi-discretization in time and spatial discretization of a class of semi-linear stochastic partial differential equations (SPDEs) with multiplicative noise. The drift term of the SPDEs is only assumed to satisfy a one-sided Lipschitz condition and the diffusion term is assumed to be globally Lipschitz continuous. Our new strategy for time discretization is based on the Milstein method from stochastic differential equations. We use the energy method for its error analysis and show a strong convergence order of nearly $1$ for the approximate solution. The proof is based on new H\"older continuity estimates of the SPDE solution and the nonlinear term. For the general polynomial-type drift term, there are difficulties in deriving even the stability of the numerical solutions. We propose an interpolation-based finite element method for spatial discretization to overcome the difficulties. Then we obtain $H^1$ stability, higher moment $H^1$ stability, $L^2$ stability, and higher moment $L^2$ stability results using numerical and stochastic techniques. The nearly optimal convergence orders in time and space are hence obtained by coupling all previous results. Numerical experiments are presented to implement the proposed numerical scheme and to validate the theoretical results.
We present an information-theoretic approach to lower bound the oracle complexity of nonsmooth black box convex optimization, unifying previous lower bounding techniques by identifying a combinatorial problem, namely string guessing, as a single source of hardness. As a measure of complexity we use distributional oracle complexity, which subsumes randomized oracle complexity as well as worst-case oracle complexity. We obtain strong lower bounds on distributional oracle complexity for the box $[-1,1]^n$, as well as for the $L^p$-ball for $p \geq 1$ (for both low-scale and large-scale regimes), matching worst-case upper bounds, and hence we close the gap between distributional complexity, and in particular, randomized complexity, and worst-case complexity. Furthermore, the bounds remain essentially the same for high-probability and bounded-error oracle complexity, and even for combination of the two, i.e., bounded-error high-probability oracle complexity. This considerably extends the applicability of known bounds.