We introduce the following variant of the VC-dimension. Given $S \subseteq \{0, 1\}^n$ and a positive integer $d$, we define $\mathbb{U}_d(S)$ to be the size of the largest subset $I \subseteq [n]$ such that the projection of $S$ on every subset of $I$ of size $d$ is the $d$-dimensional cube. We show that determining the largest cardinality of a set with a given $\mathbb{U}_d$ dimension is equivalent to a Tur\'an-type problem related to the total number of cliques in a $d$-uniform hypergraph. This allows us to beat the Sauer--Shelah lemma for this notion of dimension. We use this to obtain several results on $\Sigma_3^k$-circuits, i.e., depth-$3$ circuits with top gate OR and bottom fan-in at most $k$: * Tight relationship between the number of satisfying assignments of a $2$-CNF and the dimension of the largest projection accepted by it, thus improving Paturi, Saks, and Zane (Comput. Complex. '00). * Improved $\Sigma_3^3$-circuit lower bounds for affine dispersers for sublinear dimension. Moreover, we pose a purely hypergraph-theoretic conjecture under which we get further improvement. * We make progress towards settling the $\Sigma_3^2$ complexity of the inner product function and all degree-$2$ polynomials over $\mathbb{F}_2$ in general. The question of determining the $\Sigma_3^3$ complexity of IP was recently posed by Golovnev, Kulikov, and Williams (ITCS'21).
In the standard Gaussian linear measurement model $Y=X\mu_0+\xi \in \mathbb{R}^m$ with a fixed noise level $\sigma>0$, we consider the problem of estimating the unknown signal $\mu_0$ under a convex constraint $\mu_0 \in K$, where $K$ is a closed convex set in $\mathbb{R}^n$. We show that the risk of the natural convex constrained least squares estimator (LSE) $\hat{\mu}(\sigma)$ can be characterized exactly in high dimensional limits, by that of the convex constrained LSE $\hat{\mu}_K^{\mathsf{seq}}$ in the corresponding Gaussian sequence model at a different noise level. The characterization holds (uniformly) for risks in the maximal regime that ranges from constant order all the way down to essentially the parametric rate, as long as certain necessary non-degeneracy condition is satisfied for $\hat{\mu}(\sigma)$. The precise risk characterization reveals a fundamental difference between noiseless (or low noise limit) and noisy linear inverse problems in terms of the sample complexity for signal recovery. A concrete example is given by the isotonic regression problem: While exact recovery of a general monotone signal requires $m\gg n^{1/3}$ samples in the noiseless setting, consistent signal recovery in the noisy setting requires as few as $m\gg \log n$ samples. Such a discrepancy occurs when the low and high noise risk behavior of $\hat{\mu}_K^{\mathsf{seq}}$ differ significantly. In statistical languages, this occurs when $\hat{\mu}_K^{\mathsf{seq}}$ estimates $0$ at a faster `adaptation rate' than the slower `worst-case rate' for general signals. Several other examples, including non-negative least squares and generalized Lasso (in constrained forms), are also worked out to demonstrate the concrete applicability of the theory in problems of different types.
We establish several Schur-convexity type results under fixed variance for weighted sums of independent gamma random variables and obtain nonasymptotic bounds on their R\'enyi entropies. In particular, this pertains to the recent results by Bartczak-Nayar-Zwara as well as Bobkov-Naumov-Ulyanov, offering simple proofs of the former and extending the latter.
Empirical observation of high dimensional phenomena, such as the double descent behaviour, has attracted a lot of interest in understanding classical techniques such as kernel methods, and their implications to explain generalization properties of neural networks. Many recent works analyze such models in a certain high-dimensional regime where the covariates are independent and the number of samples and the number of covariates grow at a fixed ratio (i.e. proportional asymptotics). In this work we show that for a large class of kernels, including the neural tangent kernel of fully connected networks, kernel methods can only perform as well as linear models in this regime. More surprisingly, when the data is generated by a kernel model where the relationship between input and the response could be very nonlinear, we show that linear models are in fact optimal, i.e. linear models achieve the minimum risk among all models, linear or nonlinear. These results suggest that more complex models for the data other than independent features are needed for high-dimensional analysis.
This paper studies the inference of the regression coefficient matrix under multivariate response linear regressions in the presence of hidden variables. A novel procedure for constructing confidence intervals of entries of the coefficient matrix is proposed. Our method first utilizes the multivariate nature of the responses by estimating and adjusting the hidden effect to construct an initial estimator of the coefficient matrix. By further deploying a low-dimensional projection procedure to reduce the bias introduced by the regularization in the previous step, a refined estimator is proposed and shown to be asymptotically normal. The asymptotic variance of the resulting estimator is derived with closed-form expression and can be consistently estimated. In addition, we propose a testing procedure for the existence of hidden effects and provide its theoretical justification. Both our procedures and their analyses are valid even when the feature dimension and the number of responses exceed the sample size. Our results are further backed up via extensive simulations and a real data analysis.
For two graphs $G^<$ and $H^<$ with linearly ordered vertex sets, the \emph{ordered Ramsey number} $r_<(G^<,H^<)$ is the minimum $N$ such that every red-blue coloring of the edges of the ordered complete graph on $N$ vertices contains a red copy of $G^<$ or a blue copy of $H^<$. For a positive integer $n$, a \emph{nested matching} $NM^<_n$ is the ordered graph on $2n$ vertices with edges $\{i,2n-i+1\}$ for every $i=1,\dots,n$. We improve bounds on the ordered Ramsey numbers $r_<(NM^<_n,K^<_3)$ obtained by Rohatgi, we disprove his conjecture by showing $4n+1 \leq r_<(NM^<_n,K^<_3) \leq (3+\sqrt{5})n$ for every $n \geq 6$, and we determine the numbers $r_<(NM^<_n,K^<_3)$ exactly for $n=4,5$. As a corollary, this gives stronger lower bounds on the maximum chromatic number of $k$-queue graphs for every $k \geq 3$. We also prove $r_<(NM^<_m,K^<_n)=\Theta(mn)$ for arbitrary $m$ and $n$. We expand the classical notion of Ramsey goodness to the ordered case and we attempt to characterize all connected ordered graphs that are $n$-good for every $n\in\mathbb{N}$. In particular, we discover a new class of ordered trees that are $n$-good for every $n \in \mathbb{N}$, extending all the previously known examples.
We investigate special structures due to automorphisms in isogeny graphs of principally polarized abelian varieties, and abelian surfaces in particular. We give theoretical and experimental results on the spectral and statistical properties of (2, 2)-isogeny graphs of superspecial abelian surfaces, including stationary distributions for random walks, bounds on eigenvalues and diameters, and a proof of the connectivity of the Jacobian subgraph of the (2, 2)-isogeny graph. Our results improve our understanding of the performance and security of some recently-proposed cryptosystems, and are also a concrete step towards a better understanding of general superspecial isogeny graphs in arbitrary dimension.
In the theory of linear switching systems with discrete time, as in other areas of mathematics, the problem of studying the growth rate of the norms of all possible matrix products $A_{\sigma_{n}}\cdots A_{\sigma_{0}}$ with factors from a set of matrices $\mathscr{A}$ arises. So far, only for a relatively small number of classes of matrices $\mathscr{A}$ has it been possible to accurately describe the sequences of matrices that guarantee the maximum rate of increase of the corresponding norms. Moreover, in almost all cases studied theoretically, the index sequences $\{\sigma_{n}\}$ of matrices maximizing the norms of the corresponding matrix products have been shown to be periodic or so-called Sturmian, which entails a whole set of "good" properties of the sequences $\{A_{\sigma_{n}}\}$, in particular the existence of a limiting frequency of occurrence of each matrix factor $A_{i}\in\mathscr{A}$ in them. In the paper it is shown that this is not always the case: a class of matrices is defined consisting of two $2\times 2$ matrices, similar to rotations in the plane, in which the sequence $\{A_{\sigma_{n}}\}$ maximizing the growth rate of the norms $\|A_{\sigma_{n}}\cdots A_{\sigma_{0}}\|$ is not Sturmian. All considerations are based on numerical modeling and cannot be considered mathematically rigorous in this part; rather, they should be interpreted as a set of questions for further comprehensive theoretical analysis.
We determine the exact minimax rate of a Gaussian sequence model under bounded convex constraints, purely in terms of the local geometry of the given constraint set $K$. Our main result shows that the minimax risk (up to constant factors) under the squared $L_2$ loss is given by $\epsilon^{*2} \wedge \operatorname{diam}(K)^2$ with \begin{align*} \epsilon^* = \sup \bigg\{\epsilon : \frac{\epsilon^2}{\sigma^2} \leq \log M^{\operatorname{loc}}(\epsilon)\bigg\}, \end{align*} where $\log M^{\operatorname{loc}}(\epsilon)$ denotes the local entropy of the set $K$, and $\sigma^2$ is the variance of the noise. We utilize our abstract result to re-derive known minimax rates for some special sets $K$ such as hyperrectangles, ellipses, and more generally quadratically convex orthosymmetric sets. Finally, we extend our results to the unbounded case with known $\sigma^2$ to show that the minimax rate in that case is $\epsilon^{*2}$.
Graph Neural Networks (GNN) come in many flavors, but should always be either invariant (permutation of the nodes of the input graph does not affect the output) or equivariant (permutation of the input permutes the output). In this paper, we consider a specific class of invariant and equivariant networks, for which we prove new universality theorems. More precisely, we consider networks with a single hidden layer, obtained by summing channels formed by applying an equivariant linear operator, a pointwise non-linearity and either an invariant or equivariant linear operator. Recently, Maron et al. (2019) showed that by allowing higher-order tensorization inside the network, universal invariant GNNs can be obtained. As a first contribution, we propose an alternative proof of this result, which relies on the Stone-Weierstrass theorem for algebra of real-valued functions. Our main contribution is then an extension of this result to the equivariant case, which appears in many practical applications but has been less studied from a theoretical point of view. The proof relies on a new generalized Stone-Weierstrass theorem for algebra of equivariant functions, which is of independent interest. Finally, unlike many previous settings that consider a fixed number of nodes, our results show that a GNN defined by a single set of parameters can approximate uniformly well a function defined on graphs of varying size.
In this paper, we study the optimal convergence rate for distributed convex optimization problems in networks. We model the communication restrictions imposed by the network as a set of affine constraints and provide optimal complexity bounds for four different setups, namely: the function $F(\xb) \triangleq \sum_{i=1}^{m}f_i(\xb)$ is strongly convex and smooth, either strongly convex or smooth or just convex. Our results show that Nesterov's accelerated gradient descent on the dual problem can be executed in a distributed manner and obtains the same optimal rates as in the centralized version of the problem (up to constant or logarithmic factors) with an additional cost related to the spectral gap of the interaction matrix. Finally, we discuss some extensions to the proposed setup such as proximal friendly functions, time-varying graphs, improvement of the condition numbers.