When the regression function belongs to the standard smooth classes consisting of univariate functions with derivatives up to the $(\gamma+1)$th order bounded in absolute values by a common constant everywhere or a.e., it is well known that the minimax optimal rate of convergence in mean squared error (MSE) is $\left(\frac{\sigma^{2}}{n}\right)^{\frac{2\gamma+2}{2\gamma+3}}$ when $\gamma$ is finite and the sample size $n\rightarrow\infty$. From a nonasymptotic viewpoint that does not take $n$ to infinity, this paper shows that: for the standard H\"older and Sobolev classes, the minimax optimal rate is $\frac{\sigma^{2}\left(\gamma+1\right)}{n}$ ($\succsim\left(\frac{\sigma^{2}}{n}\right)^{\frac{2\gamma+2}{2\gamma+3}}$) when $\frac{n}{\sigma^{2}}\precsim\left(\gamma+1\right)^{2\gamma+3}$ and $\left(\frac{\sigma^{2}}{n}\right)^{\frac{2\gamma+2}{2\gamma+3}}$ ($\succsim\frac{\sigma^{2}\left(\gamma+1\right)}{n}$) when $\frac{n}{\sigma^{2}}\succsim\left(\gamma+1\right)^{2\gamma+3}$. To establish these results, we derive upper and lower bounds on the covering and packing numbers for the generalized H\"older class where the absolute value of the $k$th ($k=0,...,\gamma$) derivative is bounded by a parameter $R_{k}$ and the $\gamma$th derivative is $R_{\gamma+1}-$Lipschitz (and also for the generalized ellipsoid class of smooth functions). Our bounds sharpen the classical metric entropy results for the standard classes, and give the general dependence on $\gamma$ and $R_{k}$. By deriving the minimax optimal MSE rates under various (well motivated) $R_{k}$s for the smooth classes with the help of our new entropy bounds, we show several interesting results that cannot be shown with the existing entropy bounds in the literature.
We study the theory of neural network (NN) from the lens of classical nonparametric regression problems with a focus on NN's ability to adaptively estimate functions with heterogeneous smoothness --- a property of functions in Besov or Bounded Variation (BV) classes. Existing work on this problem requires tuning the NN architecture based on the function spaces and sample sizes. We consider a "Parallel NN" variant of deep ReLU networks and show that the standard weight decay is equivalent to promoting the $\ell_p$-sparsity ($0<p<1$) of the coefficient vector of an end-to-end learned function bases, i.e., a dictionary. Using this equivalence, we further establish that by tuning only the weight decay, such Parallel NN achieves an estimation error arbitrarily close to the minimax rates for both the Besov and BV classes. Notably, it gets exponentially closer to minimax optimal as the NN gets deeper. Our research sheds new lights on why depth matters and how NNs are more powerful than kernel methods.
We introduce a new distortion measure for point processes called functional-covering distortion. It is inspired by intensity theory and is related to both the covering of point processes and logarithmic loss distortion. We obtain the distortion-rate function with feedforward under this distortion measure for a large class of point processes. For Poisson processes, the rate-distortion function is obtained under a general condition called constrained functional-covering distortion, of which both covering and functional-covering are special cases. Also for Poisson processes, we characterize the rate-distortion region for a two-encoder CEO problem and show that feedforward does not enlarge this region.
We study the problem of testing whether a function $f: \mathbb{R}^n \to \mathbb{R}$ is a polynomial of degree at most $d$ in the \emph{distribution-free} testing model. Here, the distance between functions is measured with respect to an unknown distribution $\mathcal{D}$ over $\mathbb{R}^n$ from which we can draw samples. In contrast to previous work, we do not assume that $\mathcal{D}$ has finite support. We design a tester that given query access to $f$, and sample access to $\mathcal{D}$, makes $(d/\varepsilon)^{O(1)}$ many queries to $f$, accepts with probability $1$ if $f$ is a polynomial of degree $d$, and rejects with probability at least $2/3$ if every degree-$d$ polynomial $P$ disagrees with $f$ on a set of mass at least $\varepsilon$ with respect to $\mathcal{D}$. Our result also holds under mild assumptions when we receive only a polynomial number of bits of precision for each query to $f$, or when $f$ can only be queried on rational points representable using a logarithmic number of bits. Along the way, we prove a new stability theorem for multivariate polynomials that may be of independent interest.
Reinforcement Learning (RL) approaches are lately deployed for orchestrating wireless communications empowered by Reconfigurable Intelligent Surfaces (RISs), leveraging their online optimization capabilities. Most commonly, in RL-based formulations for realistic RISs with low resolution phase-tunable elements, each configuration is modeled as a distinct reflection action, resulting to inefficient exploration due to the exponential nature of the search space. In this paper, we consider RISs with 1-bit phase resolution elements, and model the action of each of them as a binary vector including the feasible reflection coefficients. We then introduce two variations of the well-established Deep Q-Network (DQN) and Deep Deterministic Policy Gradient (DDPG) agents, aiming for effective exploration of the binary action spaces. For the case of DQN, we make use of an efficient approximation of the Q-function, whereas a discretization post-processing step is applied to the output of DDPG. Our simulation results showcase that the proposed techniques greatly outperform the baseline in terms of the rate maximization objective, when large-scale RISs are considered. In addition, when dealing with moderate scale RIS sizes, where the conventional DQN based on configuration-based action spaces is feasible, the performance of the latter technique is similar to the proposed learning approach.
Many existing algorithms for streaming geometric data analysis have been plagued by exponential dependencies in the space complexity, which are undesirable for processing high-dimensional data sets. In particular, once $d\geq\log n$, there are no known non-trivial streaming algorithms for problems such as maintaining convex hulls and L\"owner-John ellipsoids of $n$ points, despite a long line of work in streaming computational geometry since [AHV04]. We simultaneously improve these results to $\mathrm{poly}(d,\log n)$ bits of space by trading off with a $\mathrm{poly}(d,\log n)$ factor distortion. We achieve these results in a unified manner, by designing the first streaming algorithm for maintaining a coreset for $\ell_\infty$ subspace embeddings with $\mathrm{poly}(d,\log n)$ space and $\mathrm{poly}(d,\log n)$ distortion. Our algorithm also gives similar guarantees in the \emph{online coreset} model. Along the way, we sharpen results for online numerical linear algebra by replacing a log condition number dependence with a $\log n$ dependence, answering a question of [BDM+20]. Our techniques provide a novel connection between leverage scores, a fundamental object in numerical linear algebra, and computational geometry. For $\ell_p$ subspace embeddings, we give nearly optimal trade-offs between space and distortion for one-pass streaming algorithms. For instance, we give a deterministic coreset using $O(d^2\log n)$ space and $O((d\log n)^{1/2-1/p})$ distortion for $p>2$, whereas previous deterministic algorithms incurred a $\mathrm{poly}(n)$ factor in the space or the distortion [CDW18]. Our techniques have implications in the offline setting, where we give optimal trade-offs between the space complexity and distortion of subspace sketch data structures. To do this, we give an elementary proof of a "change of density" theorem of [LT80] and make it algorithmic.
We consider the problem of nonparametric estimation of the drift and diffusion coefficients of a Stochastic Differential Equation (SDE), based on $n$ independent replicates $\left\{X_i(t)\::\: t\in [0,1]\right\}_{1 \leq i \leq n}$, observed sparsely and irregularly on the unit interval, and subject to additive noise corruption. By \textit{sparse} we intend to mean that the number of measurements per path can be arbitrary (as small as two), and remain constant with respect to $n$. We focus on time-inhomogeneous SDE of the form $dX_t = \mu(t)X_t^{\alpha}dt + \sigma(t)X_t^{\beta}dW_t$, where $\alpha \in \{0,1\}$ and $\beta \in \{0,1/2,1\}$, which includes prominent examples such as Brownian motion, Ornstein-Uhlenbeck process, geometric Brownian motion, and Brownian bridge. Our estimators are constructed by relating the local (drift/diffusion) parameters of the diffusion to their global parameters (mean/covariance, and their derivatives) by means of an apparently novel PDE. This allows us to use methods inspired by functional data analysis, and pool information across the sparsely measured paths. The methodology we develop is fully non-parametric and avoids any functional form specification on the time-dependency of either the drift function or the diffusion function. We establish almost sure uniform asymptotic convergence rates of the proposed estimators as the number of observed curves $n$ grows to infinity. Our rates are non-asymptotic in the number of measurements per path, explicitly reflecting how different sampling frequency might affect the speed of convergence. Our framework suggests possible further fruitful interactions between FDA and SDE methods in problems with replication.
Momentum methods, including heavy-ball~(HB) and Nesterov's accelerated gradient~(NAG), are widely used in training neural networks for their fast convergence. However, there is a lack of theoretical guarantees for their convergence and acceleration since the optimization landscape of the neural network is non-convex. Nowadays, some works make progress towards understanding the convergence of momentum methods in an over-parameterized regime, where the number of the parameters exceeds that of the training instances. Nonetheless, current results mainly focus on the two-layer neural network, which are far from explaining the remarkable success of the momentum methods in training deep neural networks. Motivated by this, we investigate the convergence of NAG with constant learning rate and momentum parameter in training two architectures of deep linear networks: deep fully-connected linear neural networks and deep linear ResNets. Based on the over-parameterization regime, we first analyze the residual dynamics induced by the training trajectory of NAG for a deep fully-connected linear neural network under the random Gaussian initialization. Our results show that NAG can converge to the global minimum at a $(1 - \mathcal{O}(1/\sqrt{\kappa}))^t$ rate, where $t$ is the iteration number and $\kappa > 1$ is a constant depending on the condition number of the feature matrix. Compared to the $(1 - \mathcal{O}(1/{\kappa}))^t$ rate of GD, NAG achieves an acceleration over GD. To the best of our knowledge, this is the first theoretical guarantee for the convergence of NAG to the global minimum in training deep neural networks. Furthermore, we extend our analysis to deep linear ResNets and derive a similar convergence result.
We consider M-estimation problems, where the target value is determined using a minimizer of an expected functional of a Levy process. With discrete observations from the Levy process, we can produce a "quasi-path" by shuffling increments of the Levy process, we call it a quasi-process. Under a suitable sampling scheme, a quasi-process can converge weakly to the true process according to the properties of the stationary and independent increments. Using this resampling technique, we can estimate objective functionals similar to those estimated using the Monte Carlo simulations, and it is available as a contrast function. The M-estimator based on these quasi-processes can be consistent and asymptotically normal.
Existing inferential methods for small area data involve a trade-off between maintaining area-level frequentist coverage rates and improving inferential precision via the incorporation of indirect information. In this article, we propose a method to obtain an area-level prediction region for a future observation which mitigates this trade-off. The proposed method takes a conformal prediction approach in which the conformity measure is the posterior predictive density of a working model that incorporates indirect information. The resulting prediction region has guaranteed frequentist coverage regardless of the working model, and, if the working model assumptions are accurate, the region has minimum expected volume compared to other regions with the same coverage rate. When constructed under a normal working model, we prove such a prediction region is an interval and construct an efficient algorithm to obtain the exact interval. We illustrate the performance of our method through simulation studies and an application to EPA radon survey data.
Universal coding of integers~(UCI) is a class of variable-length code, such that the ratio of the expected codeword length to $\max\{1,H(P)\}$ is within a constant factor, where $H(P)$ is the Shannon entropy of the decreasing probability distribution $P$. However, if we consider the ratio of the expected codeword length to $H(P)$, the ratio tends to infinity by using UCI, when $H(P)$ tends to zero. To solve this issue, this paper introduces a class of codes, termed generalized universal coding of integers~(GUCI), such that the ratio of the expected codeword length to $H(P)$ is within a constant factor $K$. First, the definition of GUCI is proposed and the coding structure of GUCI is introduced. Next, we propose a class of GUCI $\mathcal{C}$ to achieve the expansion factor $K_{\mathcal{C}}=2$ and show that the optimal GUCI is in the range $1\leq K_{\mathcal{C}}^{*}\leq 2$. Then, by comparing UCI and GUCI, we show that when the entropy is very large or $P(0)$ is not large, there are also cases where the average codeword length of GUCI is shorter. Finally, the asymptotically optimal GUCI is presented.