This paper establishes the (nearly) optimal approximation error characterization of deep rectified linear unit (ReLU) networks for smooth functions in terms of both width and depth simultaneously. To that end, we first prove that multivariate polynomials can be approximated by deep ReLU networks of width $\mathcal{O}(N)$ and depth $\mathcal{O}(L)$ with an approximation error $\mathcal{O}(N^{-L})$. Through local Taylor expansions and their deep ReLU network approximations, we show that deep ReLU networks of width $\mathcal{O}(N\ln N)$ and depth $\mathcal{O}(L\ln L)$ can approximate $f\in C^s([0,1]^d)$ with a nearly optimal approximation error $\mathcal{O}(\|f\|_{C^s([0,1]^d)}N^{-2s/d}L^{-2s/d})$. Our estimate is non-asymptotic in the sense that it is valid for arbitrary width and depth specified by $N\in\mathbb{N}^+$ and $L\in\mathbb{N}^+$, respectively.
Deep Neural Networks (DNNs) have performed admirably in classification tasks. However, the characterization of their classification uncertainties, required for certain applications, has been lacking. In this work, we investigate the issue by assessing DNNs' ability to estimate conditional probabilities and propose a framework for systematic uncertainty characterization. Denoting the input sample as x and the category as y, the classification task of assigning a category y to a given input x can be reduced to the task of estimating the conditional probabilities p(y|x), as approximated by the DNN at its last layer using the softmax function. Since softmax yields a vector whose elements all fall in the interval (0, 1) and sum to 1, it suggests a probabilistic interpretation to the DNN's outcome. Using synthetic and real-world datasets, we look into the impact of various factors, e.g., probability density f(x) and inter-categorical sparsity, on the precision of DNNs' estimations of p(y|x), and find that the likelihood probability density and the inter-categorical sparsity have greater impacts than the prior probability to DNNs' classification uncertainty.
We obtained convergence rates of the collocation approximation by deep ReLU neural networks of the solution $u$ to elliptic PDEs with lognormal inputs, parametrized by $\boldsymbol{y}$ from the non-compact set $\mathbb{R}^\infty$. The approximation error is measured in the norm of the Bochner space $L_2(\mathbb{R}^\infty, V, \gamma)$, where $\gamma$ is the infinite tensor product standard Gaussian probability measure on $\mathbb{R}^\infty$ and $V$ is the energy space. Under a certain assumption on $\ell_q$-summability for the lognormal inputs $(0<q<2)$, we proved that given arbitrary number $\delta >0$ small enough, for every integer $n > 1$, one can construct a compactly supported deep ReLU neural network $\boldsymbol{\phi}_n:= \big(\phi_j\big)_{j=1}^m$ of size at most $n$ on $\mathbb{R}^m$ with $m =\mathcal{O}(n^{1 - \delta})$, and a sequence of points $\big(\boldsymbol{y}j\big)_{j=1}^m \subset \mathbb{R}^m$ (which are independent of $u$) so that the collocation approximation of $u$ by $\Phi_n u:= \sum_{j=1}^m u\big(\boldsymbol{y}^j\big) \Phi_j,$ which is based on the $m$ solvers $\Big( u\big(\boldsymbol{y}^j\big)\Big)_{j=1}^m$ and the deep ReLU network $\boldsymbol{\phi}_n$, gives the twofold error bounds: $\|u- \Phi_n u \|_{L_2(\mathbb{R}^\infty V, \gamma)} = \mathcal{O}\left(m^{- (1/q - 1/2)}\right) =\mathcal{O}\left(n^{- (1-\delta)(1/q - 1/2)}\right),$ where $\Phi_j$ are the extensions of $\phi_j$ to the whole $\mathbb{R}^\infty$. We also obtained similar results for the case when the lognormal inputs are parametrized on $\mathbb{R}^M$ with very large dimension $M$, and the approximation error is measured in the $\sqrt{g_M}$-weighted uniform norm of the Bochner space $L_\infty^{\sqrt{g}}(\mathbb{R}^M, V)$, where $g_M$ is the density function of the standard Gaussian probability measure on $\mathbb{R}^M$.
This paper studies the approximation error of ReLU networks in terms of the number of intrinsic parameters (i.e., those depending on the target function $f$). First, we prove by construction that, for any Lipschitz continuous function $f$ on $[0,1]^d$ with a Lipschitz constant $\lambda>0$, a ReLU network with $n+2$ intrinsic parameters can approximate $f$ with an exponentially small error $5\lambda \sqrt{d}\,2^{-n}$ measured in the $L^p$-norm for $p\in [1,\infty)$. More generally for an arbitrary continuous function $f$ on $[0,1]^d$ with a modulus of continuity $\omega_f(\cdot)$, the approximation error is $\omega_f(\sqrt{d}\, 2^{-n})+2^{-n+2}\omega_f(\sqrt{d})$. Next, we extend these two results from the $L^p$-norm to the $L^\infty$-norm at a price of $3^d n+2$ intrinsic parameters. Finally, by using a high-precision binary representation and the bit extraction technique via a fixed ReLU network independent of the target function, we design, theoretically, a ReLU network with only three intrinsic parameters to approximate H\"older continuous functions with an arbitrarily small error.
We revisit Min-Mean-Cycle, the classical problem of finding a cycle in a weighted directed graph with minimum mean weight. Despite an extensive algorithmic literature, previous work falls short of a near-linear runtime in the number of edges $m$. We propose an approximation algorithm that, for graphs with polylogarithmic diameter, achieves a near-linear runtime. In particular, this is the first algorithm whose runtime scales in the number of vertices $n$ as $\tilde{O}(n^2)$ for the complete graph. Moreover, unconditionally on the diameter, the algorithm uses only $O(n)$ memory beyond reading the input, making it "memory-optimal". Our approach is based on solving a linear programming relaxation using entropic regularization, which reduces the problem to Matrix Balancing -- \'a la the popular reduction of Optimal Transport to Matrix Scaling. The algorithm is practical and simple to implement.
For artificial deep neural networks, we prove expression rates for analytic functions $f:\mathbb{R}^d\to\mathbb{R}$ in the norm of $L^2(\mathbb{R}^d,\gamma_d)$ where $d\in {\mathbb{N}}\cup\{ \infty \}$. Here $\gamma_d$ denotes the Gaussian product probability measure on $\mathbb{R}^d$. We consider in particular ReLU and ReLU${}^k$ activations for integer $k\geq 2$. For $d\in\mathbb{N}$, we show exponential convergence rates in $L^2(\mathbb{R}^d,\gamma_d)$. In case $d=\infty$, under suitable smoothness and sparsity assumptions on $f:\mathbb{R}^{\mathbb{N}}\to\mathbb{R}$, with $\gamma_\infty$ denoting an infinite (Gaussian) product measure on $\mathbb{R}^{\mathbb{N}}$, we prove dimension-independent expression rate bounds in the norm of $L^2(\mathbb{R}^{\mathbb{N}},\gamma_\infty)$. The rates only depend on quantified holomorphy of (an analytic continuation of) the map $f$ to a product of strips in $\mathbb{C}^d$. As an application, we prove expression rate bounds of deep ReLU-NNs for response surfaces of elliptic PDEs with log-Gaussian random field inputs.
For a graph class $\mathcal{C}$, the $\mathcal{C}$-Edge-Deletion problem asks for a given graph $G$ to delete the minimum number of edges from $G$ in order to obtain a graph in $\mathcal{C}$. We study the $\mathcal{C}$-Edge-Deletion problem for $\mathcal{C}$ the permutation graphs, interval graphs, and other related graph classes. It follows from Courcelle's Theorem that these problems are fixed parameter tractable when parameterized by treewidth. In this paper, we present concrete FPT algorithms for these problems. By giving explicit algorithms and analyzing these in detail, we obtain algorithms that are significantly faster than the algorithms obtained by using Courcelle's theorem.
We study the problem of list-decodable mean estimation, where an adversary can corrupt a majority of the dataset. Specifically, we are given a set $T$ of $n$ points in $\mathbb{R}^d$ and a parameter $0< \alpha <\frac 1 2$ such that an $\alpha$-fraction of the points in $T$ are i.i.d. samples from a well-behaved distribution $\mathcal{D}$ and the remaining $(1-\alpha)$-fraction are arbitrary. The goal is to output a small list of vectors, at least one of which is close to the mean of $\mathcal{D}$. We develop new algorithms for list-decodable mean estimation, achieving nearly-optimal statistical guarantees, with running time $O(n^{1 + \epsilon_0} d)$, for any fixed $\epsilon_0 > 0$. All prior algorithms for this problem had additional polynomial factors in $\frac 1 \alpha$. We leverage this result, together with additional techniques, to obtain the first almost-linear time algorithms for clustering mixtures of $k$ separated well-behaved distributions, nearly-matching the statistical guarantees of spectral methods. Prior clustering algorithms inherently relied on an application of $k$-PCA, thereby incurring runtimes of $\Omega(n d k)$. This marks the first runtime improvement for this basic statistical problem in nearly two decades. The starting point of our approach is a novel and simpler near-linear time robust mean estimation algorithm in the $\alpha \to 1$ regime, based on a one-shot matrix multiplicative weights-inspired potential decrease. We crucially leverage this new algorithmic framework in the context of the iterative multi-filtering technique of Diakonikolas et al. '18, '20, providing a method to simultaneously cluster and downsample points using one-dimensional projections -- thus, bypassing the $k$-PCA subroutines required by prior algorithms.
In this paper, from a theoretical perspective, we study how powerful graph neural networks (GNNs) can be for learning approximation algorithms for combinatorial problems. To this end, we first establish a new class of GNNs that can solve strictly a wider variety of problems than existing GNNs. Then, we bridge the gap between GNN theory and the theory of distributed local algorithms to theoretically demonstrate that the most powerful GNN can learn approximation algorithms for the minimum dominating set problem and the minimum vertex cover problem with some approximation ratios and that no GNN can perform better than with these ratios. This paper is the first to elucidate approximation ratios of GNNs for combinatorial problems. Furthermore, we prove that adding coloring or weak-coloring to each node feature improves these approximation ratios. This indicates that preprocessing and feature engineering theoretically strengthen model capabilities.
In order to avoid the curse of dimensionality, frequently encountered in Big Data analysis, there was a vast development in the field of linear and nonlinear dimension reduction techniques in recent years. These techniques (sometimes referred to as manifold learning) assume that the scattered input data is lying on a lower dimensional manifold, thus the high dimensionality problem can be overcome by learning the lower dimensionality behavior. However, in real life applications, data is often very noisy. In this work, we propose a method to approximate $\mathcal{M}$ a $d$-dimensional $C^{m+1}$ smooth submanifold of $\mathbb{R}^n$ ($d \ll n$) based upon noisy scattered data points (i.e., a data cloud). We assume that the data points are located "near" the lower dimensional manifold and suggest a non-linear moving least-squares projection on an approximating $d$-dimensional manifold. Under some mild assumptions, the resulting approximant is shown to be infinitely smooth and of high approximation order (i.e., $O(h^{m+1})$, where $h$ is the fill distance and $m$ is the degree of the local polynomial approximation). The method presented here assumes no analytic knowledge of the approximated manifold and the approximation algorithm is linear in the large dimension $n$. Furthermore, the approximating manifold can serve as a framework to perform operations directly on the high dimensional data in a computationally efficient manner. This way, the preparatory step of dimension reduction, which induces distortions to the data, can be avoided altogether.
Many resource allocation problems in the cloud can be described as a basic Virtual Network Embedding Problem (VNEP): finding mappings of request graphs (describing the workloads) onto a substrate graph (describing the physical infrastructure). In the offline setting, the two natural objectives are profit maximization, i.e., embedding a maximal number of request graphs subject to the resource constraints, and cost minimization, i.e., embedding all requests at minimal overall cost. The VNEP can be seen as a generalization of classic routing and call admission problems, in which requests are arbitrary graphs whose communication endpoints are not fixed. Due to its applications, the problem has been studied intensively in the networking community. However, the underlying algorithmic problem is hardly understood. This paper presents the first fixed-parameter tractable approximation algorithms for the VNEP. Our algorithms are based on randomized rounding. Due to the flexible mapping options and the arbitrary request graph topologies, we show that a novel linear program formulation is required. Only using this novel formulation the computation of convex combinations of valid mappings is enabled, as the formulation needs to account for the structure of the request graphs. Accordingly, to capture the structure of request graphs, we introduce the graph-theoretic notion of extraction orders and extraction width and show that our algorithms have exponential runtime in the request graphs' maximal width. Hence, for request graphs of fixed extraction width, we obtain the first polynomial-time approximations. Studying the new notion of extraction orders we show that (i) computing extraction orders of minimal width is NP-hard and (ii) that computing decomposable LP solutions is in general NP-hard, even when restricting request graphs to planar ones.