Many existing algorithms for streaming geometric data analysis have been plagued by exponential dependencies in the space complexity, which are undesirable for processing high-dimensional data sets. In particular, once $d\geq\log n$, there are no known non-trivial streaming algorithms for problems such as maintaining convex hulls and L\"owner-John ellipsoids of $n$ points, despite a long line of work in streaming computational geometry since [AHV04]. We simultaneously improve these results to $\mathrm{poly}(d,\log n)$ bits of space by trading off with a $\mathrm{poly}(d,\log n)$ factor distortion. We achieve these results in a unified manner, by designing the first streaming algorithm for maintaining a coreset for $\ell_\infty$ subspace embeddings with $\mathrm{poly}(d,\log n)$ space and $\mathrm{poly}(d,\log n)$ distortion. Our algorithm also gives similar guarantees in the \emph{online coreset} model. Along the way, we sharpen results for online numerical linear algebra by replacing a log condition number dependence with a $\log n$ dependence, answering a question of [BDM+20]. Our techniques provide a novel connection between leverage scores, a fundamental object in numerical linear algebra, and computational geometry. For $\ell_p$ subspace embeddings, we give nearly optimal trade-offs between space and distortion for one-pass streaming algorithms. For instance, we give a deterministic coreset using $O(d^2\log n)$ space and $O((d\log n)^{1/2-1/p})$ distortion for $p>2$, whereas previous deterministic algorithms incurred a $\mathrm{poly}(n)$ factor in the space or the distortion [CDW18]. Our techniques have implications in the offline setting, where we give optimal trade-offs between the space complexity and distortion of subspace sketch data structures. To do this, we give an elementary proof of a "change of density" theorem of [LT80] and make it algorithmic.
We study the problem of reconstructing solutions of inverse problems with neural networks when only noisy data is available. We assume that the problem can be modeled with an infinite-dimensional forward operator that is not continuously invertible. Then, we restrict this forward operator to finite-dimensional spaces so that the inverse is Lipschitz continuous. For the inverse operator, we demonstrate that there exists a neural network which is a robust-to-noise approximation of the operator. In addition, we show that these neural networks can be learned from appropriately perturbed training data. We demonstrate the admissibility of this approach to a wide range of inverse problems of practical interest. Numerical examples are given that support the theoretical findings.
In supervised learning using kernel methods, we often encounter a large-scale finite-sum minimization over a reproducing kernel Hilbert space (RKHS). Large-scale finite-sum problems can be solved using efficient variants of Newton method, where the Hessian is approximated via sub-samples of data. In RKHS, however, the dependence of the penalty function to kernel makes standard sub-sampling approaches inapplicable, since the gram matrix is not readily available in a low-rank form. In this paper, we observe that for this class of problems, one can naturally use kernel approximation to speed up the Newton method. Focusing on randomized features for kernel approximation, we provide a novel second-order algorithm that enjoys local superlinear convergence and global linear convergence (with high probability). We derive the theoretical lower bound for the number of random features required for the approximated Hessian to be close to the true Hessian in the norm sense. Our numerical experiments on real-world data verify the efficiency of our method compared to several benchmarks.
We show that many graphs with bounded treewidth can be described as subgraphs of the strong product of a graph with smaller treewidth and a bounded-size complete graph. To this end, define the "underlying treewidth" of a graph class $\mathcal{G}$ to be the minimum non-negative integer $c$ such that, for some function $f$, for every graph ${G \in \mathcal{G}}$ there is a graph $H$ with ${\text{tw}(H) \leq c}$ such that $G$ is isomorphic to a subgraph of ${H \boxtimes K_{f(\text{tw}(G))}}$. We introduce disjointed coverings of graphs and show they determine the underlying treewidth of any graph class. Using this result, we prove that the class of planar graphs has underlying treewidth 3; the class of $K_{s,t}$-minor-free graphs has underlying treewidth $s$ (for ${t \geq \max\{s,3\}}$); and the class of $K_t$-minor-free graphs has underlying treewidth ${t-2}$. In general, we prove that a monotone class has bounded underlying treewidth if and only if it excludes some fixed topological minor. We also study the underlying treewidth of graph classes defined by an excluded subgraph or excluded induced subgraph. We show that the class of graphs with no $H$ subgraph has bounded underlying treewidth if and only if every component of $H$ is a subdivided star, and that the class of graphs with no induced $H$ subgraph has bounded underlying treewidth if and only if every component of $H$ is a star.
Analyzing time series in the frequency domain enables the development of powerful tools for investigating the second-order characteristics of multivariate stochastic processes. Parameters like the spectral density matrix and its inverse, the coherence or the partial coherence, encode comprehensively the complex linear relations between the component processes of the multivariate system. In this paper, we develop inference procedures for such parameters in a high-dimensional, time series setup. In particular, we first focus on the derivation of consistent estimators of the coherence and, more importantly, of the partial coherence which possess manageable limiting distributions that are suitable for testing purposes. Statistical tests of the hypothesis that the maximum over frequencies of the coherence, respectively, of the partial coherence, do not exceed a prespecified threshold value are developed. Our approach allows for testing hypotheses for individual coherences and/or partial coherences as well as for multiple testing of large sets of such parameters. In the latter case, a consistent procedure to control the false discovery rate is developed. The finite sample performance of the inference procedures proposed is investigated by means of simulations and applications to the construction of graphical interaction models for brain connectivity based on EEG data are presented.
We study the log-rank conjecture from the perspective of point-hyperplane incidence geometry. We formulate the following conjecture: Given a point set in $\mathbb{R}^d$ that is covered by constant-sized sets of parallel hyperplanes, there exists an affine subspace that accounts for a large (i.e., $2^{-{\operatorname{polylog}(d)}}$) fraction of the incidences. Alternatively, our conjecture may be interpreted linear-algebraically as follows: Any rank-$d$ matrix containing at most $O(1)$ distinct entries in each column contains a submatrix of fractional size $2^{-{\operatorname{polylog}(d)}}$, in which each column contains one distinct entry. We prove that our conjecture is equivalent to the log-rank conjecture. Motivated by the connections above, we revisit well-studied questions in point-hyperplane incidence geometry without structural assumptions (i.e., the existence of partitions). We give an elementary argument for the existence of complete bipartite subgraphs of density $\Omega(\epsilon^{2d}/d)$ in any $d$-dimensional configuration with incidence density $\epsilon$. We also improve an upper-bound construction of Apfelbaum and Sharir (SIAM J. Discrete Math. '07), yielding a configuration whose complete bipartite subgraphs are exponentially small and whose incidence density is $\Omega(1/\sqrt d)$. Finally, we discuss various constructions (due to others) which yield configurations with incidence density $\Omega(1)$ and bipartite subgraph density $2^{-\Omega(\sqrt d)}$. Our framework and results may help shed light on the difficulty of improving Lovett's $\tilde{O}(\sqrt{\operatorname{rank}(f)})$ bound (J. ACM '16) for the log-rank conjecture; in particular, any improvement on this bound would imply the first bipartite subgraph size bounds for parallel $3$-partitioned configurations which beat our generic bounds for unstructured configurations.
In this paper, we propose a weak approximation of the reflection coupling (RC) for stochastic differential equations (SDEs), and prove it converges weakly to the desired coupling. In contrast to the RC, the proposed approximate reflection coupling (ARC) need not take the hitting time of processes to the diagonal set into consideration and can be defined as the solution of some SDEs on the whole time interval. Therefore, ARC can work effectively against SDEs with different drift terms. As an application of ARC, an evaluation on the effectiveness of the stochastic gradient descent in a non-convex setting is also described. For the sample size $n$, the step size $\eta$, and the batch size $B$, we derive uniform evaluations on the time with orders $n^{-1}$, $\eta^{1/2}$, and $\sqrt{(n - B) / B (n - 1)}$, respectively.
In this paper we propose a solution strategy for the Cahn-Larch\'e equations, which is a model for linearized elasticity in a medium with two elastic phases that evolve subject to a Ginzburg-Landau type energy functional. The system can be seen as a combination of the Cahn-Hilliard regularized interface equation and linearized elasticity, and is non-linearly coupled, has a fourth order term that comes from the Cahn-Hilliard subsystem, and is non-convex and nonlinear in both the phase-field and displacement variables. We propose a novel semi-implicit discretization in time that uses a standard convex-concave splitting method of the nonlinear double-well potential, as well as special treatment to the elastic energy. We show that the resulting discrete system is equivalent to a convex minimization problem, and propose and prove the convergence of alternating minimization applied to it. Finally, we present numerical experiments that show the robustness and effectiveness of both alternating minimization and the monolithic Newton method applied to the newly proposed discrete system of equations. We compare it to a system of equations that has been discretized with a standard convex-concave splitting of the double-well potential, and implicit evaluations of the elasticity contributions and show that the newly proposed discrete system is better conditioned for linearization techniques.
Markov decision processes (MDP) and continuous-time MDP (CTMDP) are the fundamental models for non-deterministic systems with probabilistic uncertainty. Mean payoff (a.k.a. long-run average reward) is one of the most classic objectives considered in their context. We provide the first algorithm to compute mean payoff probably approximately correctly in unknown MDP; further, we extend it to unknown CTMDP. We do not require any knowledge of the state space, only a lower bound on the minimum transition probability, which has been advocated in literature. In addition to providing probably approximately correct (PAC) bounds for our algorithm, we also demonstrate its practical nature by running experiments on standard benchmarks.
We show that Contrastive Learning (CL) under a broad family of loss functions (including InfoNCE) has a unified formulation of coordinate-wise optimization on the network parameter $\boldsymbol{\theta}$ and pairwise importance $\alpha$, where the \emph{max player} $\boldsymbol{\theta}$ learns representation for contrastiveness, and the \emph{min player} $\alpha$ puts more weights on pairs of distinct samples that share similar representations. The resulting formulation, called $\alpha$-CL, unifies not only various existing contrastive losses, which differ by how sample-pair importance $\alpha$ is constructed, but also is able to extrapolate to give novel contrastive losses beyond popular ones, opening a new avenue of contrastive loss design. These novel losses yield comparable (or better) performance on CIFAR10 and STL-10 than classic InfoNCE. Furthermore, we also analyze the max player in detail: we prove that with fixed $\alpha$, max player is equivalent to Principal Component Analysis (PCA) for deep linear network, and almost all local minima are global and rank-1, recovering optimal PCA solutions. Finally, we extend our analysis on max player to 2-layer ReLU networks, showing that its fixed points can have higher ranks.
Substantial progress has been made recently on developing provably accurate and efficient algorithms for low-rank matrix factorization via nonconvex optimization. While conventional wisdom often takes a dim view of nonconvex optimization algorithms due to their susceptibility to spurious local minima, simple iterative methods such as gradient descent have been remarkably successful in practice. The theoretical footings, however, had been largely lacking until recently. In this tutorial-style overview, we highlight the important role of statistical models in enabling efficient nonconvex optimization with performance guarantees. We review two contrasting approaches: (1) two-stage algorithms, which consist of a tailored initialization step followed by successive refinement; and (2) global landscape analysis and initialization-free algorithms. Several canonical matrix factorization problems are discussed, including but not limited to matrix sensing, phase retrieval, matrix completion, blind deconvolution, robust principal component analysis, phase synchronization, and joint alignment. Special care is taken to illustrate the key technical insights underlying their analyses. This article serves as a testament that the integrated consideration of optimization and statistics leads to fruitful research findings.