We study the problem of testing whether an unknown set $S$ in $n$ dimensions is convex or far from convex, using membership queries. The simplest high-dimensional discrete domain where the problem of testing convexity is non-trivial is the domain $\{-1,0,1\}^n$. Our main results are nearly-tight upper and lower bounds of $3^{\widetilde \Theta( \sqrt n)}$ for one-sided error testing of convex sets over this domain with non-adaptive queries. Together with our $3^{\Omega(n)}$ lower bound on one-sided error testing with samples, this shows that non-adaptive queries are significantly more powerful than samples for this problem.
Reinforcement learning (RL) algorithms have proven transformative in a range of domains. To tackle real-world domains, these systems often use neural networks to learn policies directly from pixels or other high-dimensional sensory input. By contrast, much theory of RL has focused on discrete state spaces or worst-case analysis, and fundamental questions remain about the dynamics of policy learning in high-dimensional settings. Here, we propose a solvable high-dimensional model of RL that can capture a variety of learning protocols, and derive its typical dynamics as a set of closed-form ordinary differential equations (ODEs). We derive optimal schedules for the learning rates and task difficulty - analogous to annealing schemes and curricula during training in RL - and show that the model exhibits rich behaviour, including delayed learning under sparse rewards; a variety of learning regimes depending on reward baselines; and a speed-accuracy trade-off driven by reward stringency. Experiments on variants of the Procgen game "Bossfight" and Arcade Learning Environment game "Pong" also show such a speed-accuracy trade-off in practice. Together, these results take a step towards closing the gap between theory and practice in high-dimensional RL.
Nested simulation concerns estimating functionals of a conditional expectation via simulation. In this paper, we propose a new method based on kernel ridge regression to exploit the smoothness of the conditional expectation as a function of the multidimensional conditioning variable. Asymptotic analysis shows that the proposed method can effectively alleviate the curse of dimensionality on the convergence rate as the simulation budget increases, provided that the conditional expectation is sufficiently smooth. The smoothness bridges the gap between the cubic root convergence rate (that is, the optimal rate for the standard nested simulation) and the square root convergence rate (that is, the canonical rate for the standard Monte Carlo simulation). We demonstrate the performance of the proposed method via numerical examples from portfolio risk management and input uncertainty quantification.
We revisit the problem of computing with noisy information considered in Feige et al. 1994, which includes computing the OR function from noisy queries, and computing the MAX, SEARCH and SORT functions from noisy pairwise comparisons. For $K$ given elements, the goal is to correctly recover the desired function with probability at least $1-\delta$ when the outcome of each query is flipped with probability $p$. We consider both the adaptive sampling setting where each query can be adaptively designed based on past outcomes, and the non-adaptive sampling setting where the query cannot depend on past outcomes. The prior work provides tight bounds on the worst-case query complexity in terms of the dependence on $K$. However, the upper and lower bounds do not match in terms of the dependence on $\delta$ and $p$. We improve the lower bounds for all the four functions under both adaptive and non-adaptive query models. Most of our lower bounds match the upper bounds up to constant factors when either $p$ or $\delta$ is bounded away from $0$, while the ratio between the best prior upper and lower bounds goes to infinity when $p\rightarrow 0$ or $p\rightarrow 1/2$. On the other hand, we also provide matching upper and lower bounds for the number of queries in expectation, improving both the upper and lower bounds for the variable-length query model.
The convexity of a set can be generalized to the two weaker notions of reach and $r$-convexity; both describe the regularity of a set's boundary. For any compact subset of $\mathbb{R}^d$, we provide methods for computing upper bounds on these quantities from point cloud data. The bounds converge to the respective quantities as the point cloud becomes dense in the set, and the rate of convergence for the bound on the reach is given under a weak regularity condition. We also introduce the $\beta$-reach, a generalization of the reach that excludes small-scale features of size less than a parameter $\beta\in[0,\infty)$. Numerical studies suggest how the $\beta$-reach can be used in high-dimension to infer the reach and other geometric properties of smooth submanifolds.
Cuckoo hashing is a powerful primitive that enables storing items using small space with efficient querying. At a high level, cuckoo hashing maps $n$ items into $b$ entries storing at most $\ell$ items such that each item is placed into one of $k$ randomly chosen entries. Additionally, there is an overflow stash that can store at most $s$ items. Many cryptographic primitives rely upon cuckoo hashing to privately embed and query data where it is integral to ensure small failure probability when constructing cuckoo hashing tables as it directly relates to the privacy guarantees. As our main result, we present a more query-efficient cuckoo hashing construction using more hash functions. For construction failure probability $\epsilon$, the query overhead of our scheme is $O(1 + \sqrt{\log(1/\epsilon)/\log n})$. Our scheme has quadratically smaller query overhead than prior works for any target failure probability $\epsilon$. We also prove lower bounds matching our construction. Our improvements come from a new understanding of the locality of cuckoo hashing failures for small sets of items. We also initiate the study of robust cuckoo hashing where the input set may be chosen with knowledge of the hash functions. We present a cuckoo hashing scheme using more hash functions with query overhead $\tilde{O}(\log \lambda)$ that is robust against poly$(\lambda)$ adversaries. Furthermore, we present lower bounds showing that this construction is tight and that extending previous approaches of large stashes or entries cannot obtain robustness except with $\Omega(n)$ query overhead. As applications of our results, we obtain improved constructions for batch codes and PIR. In particular, we present the most efficient explicit batch code and blackbox reduction from single-query PIR to batch PIR.
This paper considers an ML inspired approach to hypothesis testing known as classifier/classification-accuracy testing ($\mathsf{CAT}$). In $\mathsf{CAT}$, one first trains a classifier by feeding it labeled synthetic samples generated by the null and alternative distributions, which is then used to predict labels of the actual data samples. This method is widely used in practice when the null and alternative are only specified via simulators (as in many scientific experiments). We study goodness-of-fit, two-sample ($\mathsf{TS}$) and likelihood-free hypothesis testing ($\mathsf{LFHT}$), and show that $\mathsf{CAT}$ achieves (near-)minimax optimal sample complexity in both the dependence on the total-variation ($\mathsf{TV}$) separation $\epsilon$ and the probability of error $\delta$ in a variety of non-parametric settings, including discrete distributions, $d$-dimensional distributions with a smooth density, and the Gaussian sequence model. In particular, we close the high probability sample complexity of $\mathsf{LFHT}$ for each class. As another highlight, we recover the minimax optimal complexity of $\mathsf{TS}$ over discrete distributions, which was recently established by Diakonikolas et al. (2021). The corresponding $\mathsf{CAT}$ simply compares empirical frequencies in the first half of the data, and rejects the null when the classification accuracy on the second half is better than random.
The bootstrap is a popular data-driven method to quantify statistical uncertainty, but for modern high-dimensional problems, it could suffer from huge computational costs due to the need to repeatedly generate resamples and refit models. We study the use of bootstraps in high-dimensional environments with a small number of resamples. In particular, we show that with a recent "cheap" bootstrap perspective, using a number of resamples as small as one could attain valid coverage even when the dimension grows closely with the sample size, thus strongly supporting the implementability of the bootstrap for large-scale problems. We validate our theoretical results and compare the performance of our approach with other benchmarks via a range of experiments.
We study a mean change point testing problem for high-dimensional data, with exponentially- or polynomially-decaying tails. In each case, depending on the $\ell_0$-norm of the mean change vector, we separately consider dense and sparse regimes. We characterise the boundary between the dense and sparse regimes under the above two tail conditions for the first time in the change point literature and propose novel testing procedures that attain optimal rates in each of the four regimes up to a poly-iterated logarithmic factor. By comparing with previous results under Gaussian assumptions, our results quantify the costs of heavy-tailedness on the fundamental difficulty of change point testing problems for high-dimensional data. To be specific, when the error vectors follow sub-Weibull distributions, a CUSUM-type statistic is shown to achieve a minimax testing rate up to $\sqrt{\log\log(8n)}$. When the error distributions have polynomially-decaying tails, admitting bounded $\alpha$-th moments for some $\alpha \geq 4$, we introduce a median-of-means-type test statistic that achieves a near-optimal testing rate in both dense and sparse regimes. In particular, in the sparse regime, we further propose a computationally-efficient test to achieve the exact optimality. Surprisingly, our investigation in the even more challenging case of $2 \leq \alpha < 4$, unveils a new phenomenon that the minimax testing rate has no sparse regime, i.e.\ testing sparse changes is information-theoretically as hard as testing dense changes. This phenomenon implies a phase transition of the minimax testing rates at $\alpha = 4$.
An algorithm is said to be adaptive to a certain parameter (of the problem) if it does not need a priori knowledge of such a parameter but performs competitively to those that know it. This dissertation presents our work on adaptive algorithms in following scenarios: 1. In the stochastic optimization setting, we only receive stochastic gradients and the level of noise in evaluating them greatly affects the convergence rate. Tuning is typically required when without prior knowledge of the noise scale in order to achieve the optimal rate. Considering this, we designed and analyzed noise-adaptive algorithms that can automatically ensure (near)-optimal rates under different noise scales without knowing it. 2. In training deep neural networks, the scales of gradient magnitudes in each coordinate can scatter across a very wide range unless normalization techniques, like BatchNorm, are employed. In such situations, algorithms not addressing this problem of gradient scales can behave very poorly. To mitigate this, we formally established the advantage of scale-free algorithms that adapt to the gradient scales and presented its real benefits in empirical experiments. 3. Traditional analyses in non-convex optimization typically rely on the smoothness assumption. Yet, this condition does not capture the properties of some deep learning objective functions, including the ones involving Long Short-Term Memory networks and Transformers. Instead, they satisfy a much more relaxed condition, with potentially unbounded smoothness. Under this condition, we show that a generalized SignSGD algorithm can theoretically match the best-known convergence rates obtained by SGD with gradient clipping but does not need explicit clipping at all, and it can empirically match the performance of Adam and beat others. Moreover, it can also be made to automatically adapt to the unknown relaxed smoothness.
For modern gradient-based optimization, a developmental landmark is Nesterov's accelerated gradient descent method, which is proposed in [Nesterov, 1983], so shorten as Nesterov-1983. Afterward, one of the important progresses is its proximal generalization, named the fast iterative shrinkage-thresholding algorithm (FISTA), which is widely used in image science and engineering. However, it is unknown whether both Nesterov-1983 and FISTA converge linearly on the strongly convex function, which has been listed as the open problem in the comprehensive review [Chambolle and Pock, 2016, Appendix B]. In this paper, we answer this question by the use of the high-resolution differential equation framework. Along with the phase-space representation previously adopted, the key difference here in constructing the Lyapunov function is that the coefficient of the kinetic energy varies with the iteration. Furthermore, we point out that the linear convergence of both the two algorithms above has no dependence on the parameter $r$ on the strongly convex function. Meanwhile, it is also obtained that the proximal subgradient norm converges linearly.