In this paper we obtain improved iteration complexities for solving $\ell_p$ regression. We provide methods which given any full-rank $\mathbf{A} \in \mathbb{R}^{n \times d}$ with $n \geq d$, $b \in \mathbb{R}^n$, and $p \geq 2$ solve $\min_{x \in \mathbb{R}^d} \left\|\mathbf{A} x - b\right\|_p$ to high precision in time dominated by that of solving $\widetilde{O}_p(d^{\frac{p-2}{3p-2}})$ linear systems in $\mathbf{A}^\top \mathbf{D} \mathbf{A}$ for positive diagonal matrices $\mathbf{D}$. This improves upon the previous best iteration complexity of $\widetilde{O}_p(n^{\frac{p-2}{3p-2}})$ (Adil, Kyng, Peng, Sachdeva 2019). As a corollary, we obtain an $\widetilde{O}(d^{1/3}\epsilon^{-2/3})$ iteration complexity for approximate $\ell_\infty$ regression. Further, for $q \in (1, 2]$ and dual norm $q = p/(p-1)$ we provide an algorithm that solves $\ell_q$ regression in $\widetilde{O}(d^{\frac{p-2}{2p-2}})$ iterations. To obtain this result we analyze row reweightings (closely inspired by $\ell_p$-norm Lewis weights) which allow a closer connection between $\ell_2$ and $\ell_p$ regression. We provide adaptations of two different iterative optimization frameworks which leverage this connection and yield our results. The first framework is based on iterative refinement and multiplicative weights based width reduction and the second framework is based on highly smooth acceleration. Both approaches yield $\widetilde{O}_p(d^{\frac{p-2}{3p-2}})$ iteration methods but the second has a polynomial dependence on $p$ (as opposed to the exponential dependence of the first algorithm) and provides a new alternative to the previous state-of-the-art methods for $\ell_p$ regression for large $p$.
Multi-block CCA constructs linear relationships explaining coherent variations across multiple blocks of data. We view the multi-block CCA problem as finding leading generalized eigenvectors and propose to solve it via a proximal gradient descent algorithm with $\ell_1$ constraint for high dimensional data. In particular, we use a decaying sequence of constraints over proximal iterations, and show that the resulting estimate is rate-optimal under suitable assumptions. Although several previous works have demonstrated such optimality for the $\ell_0$ constrained problem using iterative approaches, the same level of theoretical understanding for the $\ell_1$ constrained formulation is still lacking. We also describe an easy-to-implement deflation procedure to estimate multiple eigenvectors sequentially. We compare our proposals to several existing methods whose implementations are available on R CRAN, and the proposed methods show competitive performances in both simulations and a real data example.
Hamilton and Moitra (2021) showed that, in certain regimes, it is not possible to accelerate Riemannian gradient descent in the hyperbolic plane if we restrict ourselves to algorithms which make queries in a (large) bounded domain and which receive gradients and function values corrupted by a (small) amount of noise. We show that acceleration remains unachievable for any deterministic algorithm which receives exact gradient and function-value information (unbounded queries, no noise). Our results hold for the classes of strongly and nonstrongly geodesically convex functions, and for a large class of Hadamard manifolds including hyperbolic spaces and the symmetric space $\mathrm{SL}(n) / \mathrm{SO}(n)$ of positive definite $n \times n$ matrices of determinant one. This cements a surprising gap between the complexity of convex optimization and geodesically convex optimization: for hyperbolic spaces, Riemannian gradient descent is optimal on the class of smooth and and strongly geodesically convex functions, in the regime where the condition number scales with the radius of the optimization domain. The key idea for proving the lower bound consists of perturbing the hard functions of Hamilton and Moitra (2021) with sums of bump functions chosen by a resisting oracle.
We present a novel method for calculating Pad\'e approximants that is capable of eliminating spurious poles placed at the point of development and of identifying and eliminating spurious poles created by precision limitations and/or noisy coefficients. Information contained in in the eliminated poles is assimilated producing a reduced order Pad\'e approximant (PA). While the [m+k/m] conformation produced by the algorithm is flexible, the m value of the rational approximant produced by the algorithm reported here is determined by the number of spurious poles eliminated. Spurious poles due to coefficient noise/precision limitations are identified using an evidence-based filter parameter applied to the singular values of a matrix comprised of the series coefficients. The rational function poles are found directly by solving a generalized eigenvalue problem defined by a matrix pencil. Spurious poles place at the point of development, responsible in some algorithms for degeneracy, are identified by their magnitudes. Residues are found by solving an overdetermined linear matrix equation. The method is compared with the so-called Robust Pad\'e Approximation (RPA) method and shown to be competitive on the problems studied. By eliminating spurious poles, particularly in functions with branch points, such as those encountered solving the power-flow problem, solution of these complex-valued problems is made more reliable.
Makespan minimization on parallel identical machines is a classical and intensively studied problem in scheduling, and a classic example for online algorithm analysis with Graham's famous list scheduling algorithm dating back to the 1960s. In this problem, jobs arrive over a list and upon an arrival, the algorithm needs to assign the job to a machine. The goal is to minimize the makespan, that is, the maximum machine load. In this paper, we consider the variant with an additional cardinality constraint: The algorithm may assign at most $k$ jobs to each machine where $k$ is part of the input. While the offline (strongly NP-hard) variant of cardinality constrained scheduling is well understood and an EPTAS exists here, no non-trivial results are known for the online variant. We fill this gap by making a comprehensive study of various different online models. First, we show that there is a constant competitive algorithm for the problem and further, present a lower bound of $2$ on the competitive ratio of any online algorithm. Motivated by the lower bound, we consider a semi-online variant where upon arrival of a job of size $p$, we are allowed to migrate jobs of total size at most a constant times $p$. This constant is called the migration factor of the algorithm. Algorithms with small migration factors are a common approach to bridge the performance of online algorithms and offline algorithms. One can obtain algorithms with a constant migration factor by rounding the size of each incoming job and then applying an ordinal algorithm to the resulting rounded instance. With this in mind, we also consider the framework of ordinal algorithms and characterize the competitive ratio that can be achieved using the aforementioned approaches.
We develop two adaptive discretization algorithms for convex semi-infinite optimization, which terminate after finitely many iterations at approximate solutions of arbitrary precision. In particular, they terminate at a feasible point of the considered optimization problem. Compared to the existing finitely feasible algorithms for general semi-infinite optimization problems, our algorithms work with considerably smaller discretizations and are thus computationally favorable. Also, our algorithms terminate at approximate solutions of arbitrary precision, while for general semi-infinite optimization problems the best possible approximate-solution precision can be arbitrarily bad. All occurring finite optimization subproblems in our algorithms have to be solved only approximately, and continuity is the only regularity assumption on our objective and constraint functions. Applications to parametric and non-parametric regression problems under shape constraints are discussed.
We present and analyze a momentum-based gradient method for training linear classifiers with an exponentially-tailed loss (e.g., the exponential or logistic loss), which maximizes the classification margin on separable data at a rate of $\widetilde{\mathcal{O}}(1/t^2)$. This contrasts with a rate of $\mathcal{O}(1/\log(t))$ for standard gradient descent, and $\mathcal{O}(1/t)$ for normalized gradient descent. This momentum-based method is derived via the convex dual of the maximum-margin problem, and specifically by applying Nesterov acceleration to this dual, which manages to result in a simple and intuitive method in the primal. This dual view can also be used to derive a stochastic variant, which performs adaptive non-uniform sampling via the dual variables.
Influence maximization is the task of selecting a small number of seed nodes in a social network to maximize the spread of the influence from these seeds, and it has been widely investigated in the past two decades. In the canonical setting, the whole social network as well as its diffusion parameters is given as input. In this paper, we consider the more realistic sampling setting where the network is unknown and we only have a set of passively observed cascades that record the set of activated nodes at each diffusion step. We study the task of influence maximization from these cascade samples (IMS), and present constant approximation algorithms for this task under mild conditions on the seed set distribution. To achieve the optimization goal, we also provide a novel solution to the network inference problem, that is, learning diffusion parameters and the network structure from the cascade data. Comparing with prior solutions, our network inference algorithm requires weaker assumptions and does not rely on maximum-likelihood estimation and convex programming. Our IMS algorithms enhance the learning-and-then-optimization approach by allowing a constant approximation ratio even when the diffusion parameters are hard to learn, and we do not need any assumption related to the network structure or diffusion parameters.
Interpretation of Deep Neural Networks (DNNs) training as an optimal control problem with nonlinear dynamical systems has received considerable attention recently, yet the algorithmic development remains relatively limited. In this work, we make an attempt along this line by reformulating the training procedure from the trajectory optimization perspective. We first show that most widely-used algorithms for training DNNs can be linked to the Differential Dynamic Programming (DDP), a celebrated second-order trajectory optimization algorithm rooted in the Approximate Dynamic Programming. In this vein, we propose a new variant of DDP that can accept batch optimization for training feedforward networks, while integrating naturally with the recent progress in curvature approximation. The resulting algorithm features layer-wise feedback policies which improve convergence rate and reduce sensitivity to hyper-parameter over existing methods. We show that the algorithm is competitive against state-ofthe-art first and second order methods. Our work opens up new avenues for principled algorithmic design built upon the optimal control theory.
In order to avoid the curse of dimensionality, frequently encountered in Big Data analysis, there was a vast development in the field of linear and nonlinear dimension reduction techniques in recent years. These techniques (sometimes referred to as manifold learning) assume that the scattered input data is lying on a lower dimensional manifold, thus the high dimensionality problem can be overcome by learning the lower dimensionality behavior. However, in real life applications, data is often very noisy. In this work, we propose a method to approximate $\mathcal{M}$ a $d$-dimensional $C^{m+1}$ smooth submanifold of $\mathbb{R}^n$ ($d \ll n$) based upon noisy scattered data points (i.e., a data cloud). We assume that the data points are located "near" the lower dimensional manifold and suggest a non-linear moving least-squares projection on an approximating $d$-dimensional manifold. Under some mild assumptions, the resulting approximant is shown to be infinitely smooth and of high approximation order (i.e., $O(h^{m+1})$, where $h$ is the fill distance and $m$ is the degree of the local polynomial approximation). The method presented here assumes no analytic knowledge of the approximated manifold and the approximation algorithm is linear in the large dimension $n$. Furthermore, the approximating manifold can serve as a framework to perform operations directly on the high dimensional data in a computationally efficient manner. This way, the preparatory step of dimension reduction, which induces distortions to the data, can be avoided altogether.
The Normalized Cut (NCut) objective function, widely used in data clustering and image segmentation, quantifies the cost of graph partitioning in a way that biases clusters or segments that are balanced towards having lower values than unbalanced partitionings. However, this bias is so strong that it avoids any singleton partitions, even when vertices are very weakly connected to the rest of the graph. Motivated by the B\"uhler-Hein family of balanced cut costs, we propose the family of Compassionately Conservative Balanced (CCB) Cut costs, which are indexed by a parameter that can be used to strike a compromise between the desire to avoid too many singleton partitions and the notion that all partitions should be balanced. We show that CCB-Cut minimization can be relaxed into an orthogonally constrained $\ell_{\tau}$-minimization problem that coincides with the problem of computing Piecewise Flat Embeddings (PFE) for one particular index value, and we present an algorithm for solving the relaxed problem by iteratively minimizing a sequence of reweighted Rayleigh quotients (IRRQ). Using images from the BSDS500 database, we show that image segmentation based on CCB-Cut minimization provides better accuracy with respect to ground truth and greater variability in region size than NCut-based image segmentation.