We study the problem of estimating an unknown function from noisy data using shallow (single-hidden layer) ReLU neural networks. The estimators we study minimize the sum of squared data-fitting errors plus a regularization term proportional to the Euclidean norm of the network weights. This minimization corresponds to the common approach of training a neural network with weight decay. We quantify the performance (mean-squared error) of these neural network estimators when the data-generating function belongs to the space of functions of second-order bounded variation in the Radon domain. This space of functions was recently proposed as the natural function space associated with shallow ReLU neural networks. We derive a minimax lower bound for the estimation problem for this function space and show that the neural network estimators are minimax optimal up to logarithmic factors. We also show that this is a "mixed variation" function space that contains classical multivariate function spaces including certain Sobolev spaces and certain spectral Barron spaces. Finally, we use these results to quantify a gap between neural networks and linear methods (which include kernel methods). This paper sheds light on the phenomenon that neural networks seem to break the curse of dimensionality.
We study the deep ReLU neural network collocation approximation of the solution $u$ to elliptic PDEs with lognormal inputs, parametrized by $\boldsymbol{y}$ from the non-compact set $\mathbb{R}^\infty$. The approximation error is measured in the norm of the Bochner space $L_2(\mathbb{R}^\infty, V, \gamma)$, where $\gamma$ is the infinite tensor product standard Gaussian probability measure on $\mathbb{R}^\infty$ and $V$ is the energy space. Under a certain assumption on $\ell_q$-summability for the lognormal inputs $(0<q<2)$, we proved that given arbitrary number $\delta >0$ small enough, for every integer $n > 1$, one can construct a compactly supported deep ReLU neural network $\boldsymbol{\phi}_n:= \big(\phi_j\big)_{j=1}^m$ of size at most $n$ on $\mathbb{R}^m$ with $m =\mathcal{O}(n^{1 - \delta})$, and a sequence of points $\big(\boldsymbol{y}j\big)_{j=1}^m \subset \mathbb{R}^m$ (which are independent of $u$) so that the collocation approximation of $u$ by $\Phi_n u:= \sum_{j=1}^m u\big(\boldsymbol{y}^j\big) \Phi_j,$ which is based on the $m$ solvers $\Big( u\big(\boldsymbol{y}^j\big)\Big)_{j=1}^m$ and the deep ReLU network $\boldsymbol{\phi}_n$, gives the twofold error bounds: $\|u- \Phi_n u \|_{L_2(\mathbb{R}^\infty V, \gamma)} = \mathcal{O}\left(m^{- (1/q - 1/2)}\right) =\mathcal{O}\left(n^{- (1-\delta)(1/q - 1/2)}\right),$ where $\Phi_j$ are the extensions of $\phi_j$ to the whole $\mathbb{R}^\infty$. We also obtained similar results for the case when the lognormal inputs are parametrized on $\mathbb{R}^M$ with very large dimension $M$, and the approximation error is measured in the $\sqrt{g_M}$-weighted uniform norm of the Bochner space $L_\infty^{\sqrt{g}}(\mathbb{R}^M, V)$, where $g_M$ is the density function of the standard Gaussian probability measure on $\mathbb{R}^M$.
This paper studies a Group Influence with Minimum cost which aims to find a seed set with smallest cost that can influence all target groups, where each user is associated with a cost and a group is influenced if the total score of the influenced users belonging to the group is at least a certain threshold. As the group-influence function is neither submodular nor supermodular, theoretical bounds on the quality of solutions returned by the well-known greedy approach may not be guaranteed. To address this challenge, we propose a bi-criteria polynomial-time approximation algorithm with high certainty. At the heart of the algorithm is a novel group reachable reverse sample concept, which helps speed up the estimation of the group influence function. Finally, extensive experiments conducted on real social networks show that our proposed algorithm outperform the state-of-the-art algorithms in terms of the objective value and the running time.
We consider a potential outcomes model in which interference may be present between any two units but the extent of interference diminishes with spatial distance. The causal estimand is the global average treatment effect, which compares counterfactual outcomes when all units are treated to outcomes when none are. We study a class of designs in which space is partitioned into clusters that are randomized into treatment and control. For each design, we estimate the treatment effect using a Horovitz-Thompson estimator that compares the average outcomes of units with all neighbors treated to units with no neighbors treated, where the neighborhood radius is of the same order as the cluster size dictated by the design. We derive the estimator's rate of convergence as a function of the design and degree of interference and use this to obtain estimator-design pairs in this class that achieve near-optimal rates of convergence under relatively minimal assumptions on interference. We prove that the estimators are asymptotically normal and provide a variance estimator. Finally, we discuss practical implementation of the designs by partitioning space using clustering algorithms.
In this work we undertake a thorough study of the non-asymptotic properties of the vanilla generative adversarial networks (GANs). We derive theoretical guarantees for the density estimation with GANs under a proper choice of the deep neural networks classes representing generators and discriminators. In particular, we prove that the resulting estimate converges to the true density $\mathsf{p}^*$ in terms of Jensen-Shannon (JS) divergence at the rate $(\log{n}/n)^{2\beta/(2\beta+d)}$ where $n$ is the sample size and $\beta$ determines the smoothness of $\mathsf{p}^*$. To the best of our knowledge, this is the first result in the literature on density estimation using vanilla GANs with JS convergence rates faster than $n^{-1/2}$ in the regime $\beta > d/2$. Moreover, we show that the obtained rate is minimax optimal (up to logarithmic factors) for the considered class of densities.
Frequency estimation is one of the most fundamental problems in streaming algorithms. Given a stream $S$ of elements from some universe $U=\{1 \ldots n\}$, the goal is to compute, in a single pass, a short sketch of $S$ so that for any element $i \in U$, one can estimate the number $x_i$ of times $i$ occurs in $S$ based on the sketch alone. Two state of the art solutions to this problems are the Count-Min and Count-Sketch algorithms. The frequency estimator $\tilde{x}$ produced by Count-Min, using $O(1/\varepsilon \cdot \log n)$ dimensions, guarantees that $\|\tilde{x}-x\|_{\infty} \le \varepsilon \|x\|_1$ with high probability, and $\tilde{x} \ge x$ holds deterministically. Also, Count-Min works under the assumption that $x \ge 0$. On the other hand, Count-Sketch, using $O(1/\varepsilon^2 \cdot \log n)$ dimensions, guarantees that $\|\tilde{x}-x\|_{\infty} \le \varepsilon \|x\|_2$ with high probability. A natural question is whether it is possible to design the best of both worlds sketching method, with error guarantees depending on the $\ell_2$ norm and space comparable to Count-Sketch, but (like Count-Min) also has the no-underestimation property. Our main set of results shows that the answer to the above question is negative. We show this in two incomparable computational models: linear sketching and streaming algorithms. We also study the complementary problem, where the sketch is required to not over-estimate, i.e., $\tilde{x} \le x$ should hold always.
Measurement error is a pervasive issue which renders the results of an analysis unreliable. The measurement error literature contains numerous correction techniques, which can be broadly divided into those which aim to produce exactly consistent estimators, and those which are only approximately consistent. While consistency is a desirable property, it is typically attained only under specific model assumptions. Two techniques, regression calibration and simulation extrapolation, are used frequently in a wide variety of parametric and semiparametric settings. However, in many settings these methods are only approximately consistent. We generalize these corrections, relaxing assumptions placed on replicate measurements. Under regularity conditions, the estimators are shown to be asymptotically normal, with a sandwich estimator for the asymptotic variance. Through simulation, we demonstrate the improved performance of the modified estimators, over the standard techniques, when these assumptions are violated. We motivate these corrections using the Framingham Heart Study, and apply the generalized techniques to an analysis of these data.
We are interested in the optimization of convex domains under a PDE constraint. Due to the difficulties of approximating convex domains in $\mathbb{R}^3$, the restriction to rotationally symmetric domains is used to reduce shape optimization problems to a two-dimensional setting. For the optimization of an eigenvalue arising in a problem of optimal insulation, the existence of an optimal domain is proven. An algorithm is proposed that can be applied to general shape optimization problems under the geometric constraints of convexity and rotational symmetry. The approximated optimal domains for the eigenvalue problem in optimal insulation are discussed.
In this paper, we develop deterministic fully dynamic algorithms for computing approximate distances in a graph with worst-case update time guarantees. In particular we obtain improved dynamic algorithms that, given an unweighted and undirected graph $G=(V,E)$ undergoing edge insertions and deletions, and a parameter $0 < \epsilon \leq 1$, maintain $(1+\epsilon)$-approximations of the $st$ distance of a single pair of nodes, the distances from a single source to all nodes ("SSSP"), the distances from multiple sources to all nodes ("MSSP''), or the distances between all nodes ("APSP"). Our main result is a deterministic algorithm for maintaining $(1+\epsilon)$-approximate single-source distances with worst-case update time $O(n^{1.529})$ (for the current best known bound on the matrix multiplication coefficient $\omega$). This matches a conditional lower bound by [BNS, FOCS 2019]. We further show that we can go beyond this SSSP bound for the problem of maintaining approximate $st$ distances by providing a deterministic algorithm with worst-case update time $O(n^{1.447})$. This even improves upon the fastest known randomized algorithm for this problem. At the core, our approach is to combine algebraic distance maintenance data structures with near-additive emulator constructions. This also leads to novel dynamic algorithms for maintaining $(1+\epsilon, \beta)$-emulators that improve upon the state of the art, which might be of independent interest. Our techniques also lead to improvements for randomized approximate diameter maintenance.
We examine one-hidden-layer neural networks with random weights. It is well-known that in the limit of infinitely many neurons they simplify to Gaussian processes. For networks with a polynomial activation, we demonstrate that the rate of this convergence in 2-Wasserstein metric is $O(n^{-\frac{1}{2}})$, where $n$ is the number of hidden neurons. We suspect this rate is asymptotically sharp. We improve the known convergence rate for other activations, to power-law in $n$ for ReLU and inverse-square-root up to logarithmic factors for erf. We explore the interplay between spherical harmonics, Stein kernels and optimal transport in the non-isotropic setting.
Sampling methods (e.g., node-wise, layer-wise, or subgraph) has become an indispensable strategy to speed up training large-scale Graph Neural Networks (GNNs). However, existing sampling methods are mostly based on the graph structural information and ignore the dynamicity of optimization, which leads to high variance in estimating the stochastic gradients. The high variance issue can be very pronounced in extremely large graphs, where it results in slow convergence and poor generalization. In this paper, we theoretically analyze the variance of sampling methods and show that, due to the composite structure of empirical risk, the variance of any sampling method can be decomposed into \textit{embedding approximation variance} in the forward stage and \textit{stochastic gradient variance} in the backward stage that necessities mitigating both types of variance to obtain faster convergence rate. We propose a decoupled variance reduction strategy that employs (approximate) gradient information to adaptively sample nodes with minimal variance, and explicitly reduces the variance introduced by embedding approximation. We show theoretically and empirically that the proposed method, even with smaller mini-batch sizes, enjoys a faster convergence rate and entails a better generalization compared to the existing methods.