We revisit the $k$-Hessian eigenvalue problem on a smooth, bounded, $(k-1)$-convex domain in $\mathbb R^n$. First, we obtain a spectral characterization of the $k$-Hessian eigenvalue as the infimum of the first eigenvalues of linear second-order elliptic operators whose coefficients belong to the dual of the corresponding G\r{a}rding cone. Second, we introduce a non-degenerate inverse iterative scheme to solve the eigenvalue problem for the $k$-Hessian operator. We show that the scheme converges, with a rate, to the $k$-Hessian eigenvalue for all $k$. When $2\leq k\leq n$, we also prove a local $L^1$ convergence of the Hessian of solutions of the scheme. Hyperbolic polynomials play an important role in our analysis.
Difference-of-Convex (DC) minimization, referring to the problem of minimizing the difference of two convex functions, has been found rich applications in statistical learning and studied extensively for decades. However, existing methods are primarily based on multi-stage convex relaxation, only leading to weak optimality of critical points. This paper proposes a coordinate descent method for minimizing DC functions based on sequential nonconvex approximation. Our approach iteratively solves a nonconvex one-dimensional subproblem globally, and it is guaranteed to converge to a coordinate-wise stationary point. We prove that this new optimality condition is always stronger than the critical point condition and the directional point condition when the objective function is weakly convex. For comparisons, we also include a naive variant of coordinate descent methods based on sequential convex approximation in our study. When the objective function satisfies an additional regularity condition called \emph{sharpness}, coordinate descent methods with an appropriate initialization converge \emph{linearly} to the optimal solution set. Also, for many applications of interest, we show that the nonconvex one-dimensional subproblem can be computed exactly and efficiently using a breakpoint searching method. We present some discussions and extensions of our proposed method. Finally, we have conducted extensive experiments on several statistical learning tasks to show the superiority of our approach. Keywords: Coordinate Descent, DC Minimization, DC Programming, Difference-of-Convex Programs, Nonconvex Optimization, Sparse Optimization, Binary Optimization.
Counting the independent sets of a graph is a classical #P-complete problem, even in the bipartite case. We give an exponential-time approximation scheme for this problem which is faster than the best known algorithm for the exact problem. The running time of our algorithm on general graphs with error tolerance $\varepsilon$ is at most $O(2^{0.2680n})$ times a polynomial in $1/\varepsilon$. On bipartite graphs, the exponential term in the running time is improved to $O(2^{0.2372n})$. Our methods combine techniques from exact exponential algorithms with techniques from approximate counting. Along the way we generalise (to the multivariate case) the FPTAS of Sinclair, Srivastava, \v{S}tefankovi\v{c} and Yin for approximating the hard-core partition function on graphs with bounded connective constant. Also, we obtain an FPTAS for counting independent sets on graphs with no vertices with degree at least 6 whose neighbours' degrees sum to 27 or more. By a result of Sly, there is no FPTAS that applies to all graphs with maximum degree 6 unless $\mbox{P}=\mbox{NP}$.
The tube method or the volume-of-tube method approximates the tail probability of the maximum of a smooth Gaussian random field with zero mean and unit variance. This method evaluates the volume of a spherical tube about the index set, and then transforms it to the tail probability. In this study, we generalize the tube method to a case in which the variance is not constant. We provide the volume formula for a spherical tube with a non-constant radius in terms of curvature tensors, and the tail probability formula of the maximum of a Gaussian random field with inhomogeneous variance, as well as its Laplace approximation. In particular, the critical radius of the tube is generalized for evaluation of the asymptotic approximation error. As an example, we discuss the approximation of the largest eigenvalue distribution of the Wishart matrix with a non-identity matrix parameter. The Bonferroni method is the tube method when the index set is a finite set. We provide the formula for the asymptotic approximation error for the Bonferroni method when the variance is not constant.
In this paper, we consider the density estimation problem associated with the stationary measure of ergodic It\^o diffusions from a discrete-time series that approximate the solutions of the stochastic differential equations. To take an advantage of the characterization of density function through the stationary solution of a parabolic-type Fokker-Planck PDE, we proceed as follows. First, we employ deep neural networks to approximate the drift and diffusion terms of the SDE by solving appropriate supervised learning tasks. Subsequently, we solve a steady-state Fokker-Plank equation associated with the estimated drift and diffusion coefficients with a neural-network-based least-squares method. We establish the convergence of the proposed scheme under appropriate mathematical assumptions, accounting for the generalization errors induced by regressing the drift and diffusion coefficients, and the PDE solvers. This theoretical study relies on a recent perturbation theory of Markov chain result that shows a linear dependence of the density estimation to the error in estimating the drift term, and generalization error results of nonparametric regression and of PDE regression solution obtained with neural-network models. The effectiveness of this method is reflected by numerical simulations of a two-dimensional Student's t distribution and a 20-dimensional Langevin dynamics.
We consider a sparse deep ReLU network (SDRN) estimator obtained from empirical risk minimization with a Lipschitz loss function in the presence of a large number of features. Our framework can be applied to a variety of regression and classification problems. The unknown target function to estimate is assumed to be in a Korobov space. Functions in this space only need to satisfy a smoothness condition rather than having a compositional structure. We develop non-asymptotic excess risk bounds for our SDRN estimator. We further derive that the SDRN estimator can achieve the same minimax rate of estimation (up to logarithmic factors) as one-dimensional nonparametric regression when the dimension of the features is fixed, and the estimator has a suboptimal rate when the dimension grows with the sample size. We show that the depth and the total number of nodes and weights of the ReLU network need to grow as the sample size increases to ensure a good performance, and also investigate how fast they should increase with the sample size. These results provide an important theoretical guidance and basis for empirical studies by deep neural networks.
We consider the approximation rates of shallow neural networks with respect to the variation norm. Upper bounds on these rates have been established for sigmoidal and ReLU activation functions, but it has remained an important open problem whether these rates are sharp. In this article, we provide a solution to this problem by proving sharp lower bounds on the approximation rates for shallow neural networks, which are obtained by lower bounding the $L^2$-metric entropy of the convex hull of the neural network basis functions. In addition, our methods also give sharp lower bounds on the Kolmogorov $n$-widths of this convex hull, which show that the variation spaces corresponding to shallow neural networks cannot be efficiently approximated by linear methods. These lower bounds apply to both sigmoidal activation functions with bounded variation and to activation functions which are a power of the ReLU. Our results also quantify how much stronger the Barron spectral norm is than the variation norm and, combined with previous results, give the asymptotics of the $L^\infty$-metric entropy up to logarithmic factors in the case of the ReLU activation function.
We introduce an error resilient distributed computing method based on an extension of the channel polarization phenomenon to distributed algorithms. The method leverages an algorithmic split operation that transforms two identical compute nodes to slow and fast workers, which parallels the channel split operation in Polar Codes. This operation preserves the average runtime, analogous to the conservation of Shannon capacity in channel polarization. By leveraging a recursive construction in a similar spirit to the Fast Fourier Transform, this method synthesizes virtual compute nodes with dispersed return time distributions, which we call computational polarization. We show that the runtime distributions form a functional martingale processes, identify their limiting distributions in closed-form expressions together with non-asymptotic convergence rates, and prove strong convergence results in Banach spaces. We provide an information-theoretic lower bound on the overall runtime of any coded computation method and show that the computational polarization approach asymptotically achieves the optimal runtime for computing linear functions. An important advantage is the near linear time decoding procedure, which is significantly cheaper than Maximum Distance Separable codes.
In this paper, a high-order and fast numerical method is investigated for the time-fractional Black-Scholes equation. In order to deal with the typical weak initial singularities of the solution, we construct a finite difference scheme with variable time steps, where the fractional derivative is approximated by the nonuniform Alikhanov formula and the sum-of-exponentials (SOE) technique. In the spatial direction, an average approximation with fourth-order accuracy is employed. The stability and the convergence with second-order in time and fourth-order in space of the proposed scheme are religiously derived by the energy method. Numerical examples are given to demonstrate the theoretical statement.
Most of existing statistical theories on deep neural networks have sample complexities cursed by the data dimension and therefore cannot well explain the empirical success of deep learning on high-dimensional data. To bridge this gap, we propose to exploit low-dimensional geometric structures of the real world data sets. We establish theoretical guarantees of convolutional residual networks (ConvResNet) in terms of function approximation and statistical estimation for binary classification. Specifically, given the data lying on a $d$-dimensional manifold isometrically embedded in $\mathbb{R}^D$, we prove that if the network architecture is properly chosen, ConvResNets can (1) approximate Besov functions on manifolds with arbitrary accuracy, and (2) learn a classifier by minimizing the empirical logistic risk, which gives an excess risk in the order of $n^{-\frac{s}{2s+2(s\vee d)}}$, where $s$ is a smoothness parameter. This implies that the sample complexity depends on the intrinsic dimension $d$, instead of the data dimension $D$. Our results demonstrate that ConvResNets are adaptive to low-dimensional structures of data sets.
Many problems on signal processing reduce to nonparametric function estimation. We propose a new methodology, piecewise convex fitting (PCF), and give a two-stage adaptive estimate. In the first stage, the number and location of the change points is estimated using strong smoothing. In the second stage, a constrained smoothing spline fit is performed with the smoothing level chosen to minimize the MSE. The imposed constraint is that a single change point occurs in a region about each empirical change point of the first-stage estimate. This constraint is equivalent to requiring that the third derivative of the second-stage estimate has a single sign in a small neighborhood about each first-stage change point. We sketch how PCF may be applied to signal recovery, instantaneous frequency estimation, surface reconstruction, image segmentation, spectral estimation and multivariate adaptive regression.