We study the universality of complex-valued neural networks with bounded widths and arbitrary depths. Under mild assumptions, we give a full description of those activation functions $\varrho:\mathbb{CC}\to \mathbb{C}$ that have the property that their associated networks are universal, i.e., are capable of approximating continuous functions to arbitrary accuracy on compact domains. Precisely, we show that deep narrow complex-valued networks are universal if and only if their activation function is neither holomorphic, nor antiholomorphic, nor $\mathbb{R}$-affine. This is a much larger class of functions than in the dual setting of arbitrary width and fixed depth. Unlike in the real case, the sufficient width differs significantly depending on the considered activation function. We show that a width of $2n+2m+5$ is always sufficient and that in general a width of $\max\{2n,2m\}$ is necessary. We prove, however, that a width of $n+m+4$ suffices for a rich subclass of the admissible activation functions. Here, $n$ and $m$ denote the input and output dimensions of the considered networks.
Despite widespread adoption in practice, guarantees for the LASSO and Group LASSO are strikingly lacking in settings beyond statistical problems, and these algorithms are usually considered to be a heuristic in the context of sparse convex optimization on deterministic inputs. We give the first recovery guarantees for the Group LASSO for sparse convex optimization with vector-valued features. We show that if a sufficiently large Group LASSO regularization is applied when minimizing a strictly convex function $l$, then the minimizer is a sparse vector supported on vector-valued features with the largest $\ell_2$ norm of the gradient. Thus, repeating this procedure selects the same set of features as the Orthogonal Matching Pursuit algorithm, which admits recovery guarantees for any function $l$ with restricted strong convexity and smoothness via weak submodularity arguments. This answers open questions of Tibshirani et al. and Yasuda et al. Our result is the first to theoretically explain the empirical success of the Group LASSO for convex functions under general input instances assuming only restricted strong convexity and smoothness. Our result also generalizes provable guarantees for the Sequential Attention algorithm, which is a feature selection algorithm inspired by the attention mechanism proposed by Yasuda et al. As an application of our result, we give new results for the column subset selection problem, which is well-studied when the loss is the Frobenius norm or other entrywise matrix losses. We give the first result for general loss functions for this problem that requires only restricted strong convexity and smoothness.
In this paper, we study the following problem. Consider a setting where a proposal is offered to the vertices of a given network $G$, and the vertices must conduct a vote and decide whether to accept the proposal or reject it. Each vertex $v$ has its own valuation of the proposal; we say that $v$ is ``happy'' if its valuation is positive (i.e., it expects to gain from adopting the proposal) and ``sad'' if its valuation is negative. However, vertices do not base their vote merely on their own valuation. Rather, a vertex $v$ is a \emph{proponent} of the proposal if a majority of its neighbors are happy with it and an \emph{opponent} in the opposite case. At the end of the vote, the network collectively accepts the proposal whenever a majority of its vertices are proponents. We study this problem on regular graphs with loops. Specifically, we consider the class ${\mathcal G}_{n|d|h}$ of $d$-regular graphs of odd order $n$ with all $n$ loops and $h$ happy vertices. We are interested in establishing necessary and sufficient conditions for the class ${\mathcal G}_{n|d|h}$ to contain a labeled graph accepting the proposal, as well as conditions to contain a graph rejecting the proposal. We also discuss connections to the existing literature, including that on majority domination, and investigate the properties of the obtained conditions.
Analysis of high-dimensional data, where the number of covariates is larger than the sample size, is a topic of current interest. In such settings, an important goal is to estimate the signal level $\tau^2$ and noise level $\sigma^2$, i.e., to quantify how much variation in the response variable can be explained by the covariates, versus how much of the variation is left unexplained. This thesis considers the estimation of these quantities in a semi-supervised setting, where for many observations only the vector of covariates $X$ is given with no responses $Y$. Our main research question is: how can one use the unlabeled data to better estimate $\tau^2$ and $\sigma^2$? We consider two frameworks: a linear regression model and a linear projection model in which linearity is not assumed. In the first framework, while linear regression is used, no sparsity assumptions on the coefficients are made. In the second framework, the linearity assumption is also relaxed and we aim to estimate the signal and noise levels defined by the linear projection. We first propose a naive estimator which is unbiased and consistent, under some assumptions, in both frameworks. We then show how the naive estimator can be improved by using zero-estimators, where a zero-estimator is a statistic arising from the unlabeled data, whose expected value is zero. In the first framework, we calculate the optimal zero-estimator improvement and discuss ways to approximate the optimal improvement. In the second framework, such optimality does no longer hold and we suggest two zero-estimators that improve the naive estimator although not necessarily optimally. Furthermore, we show that our approach reduces the variance for general initial estimators and we present an algorithm that potentially improves any initial estimator. Lastly, we consider four datasets and study the performance of our suggested methods.
We study the convergence of message passing graph neural networks on random graph models to their continuous counterpart as the number of nodes tends to infinity. Until now, this convergence was only known for architectures with aggregation functions in the form of normalized means, or, equivalently, of an application of classical operators like the adjacency matrix or the graph Laplacian. We extend such results to a large class of aggregation functions, that encompasses all classically used message passing graph neural networks, such as attention-based message passing, max convolutional message passing or (degree-normalized) convolutional message passing. Under mild assumptions, we give non-asymptotic bounds with high probability to quantify this convergence. Our main result is based on the McDiarmid inequality. Interestingly, this result does not apply to the case where the aggregation is a coordinate-wise maximum. We treat this case separately and obtain a different convergence rate.
In this work, we propose a novel approach called Operational Support Estimator Networks (OSENs) for the support estimation task. Support Estimation (SE) is defined as finding the locations of non-zero elements in a sparse signal. By its very nature, the mapping between the measurement and sparse signal is a non-linear operation. Traditional support estimators rely on computationally expensive iterative signal recovery techniques to achieve such non-linearity. Contrary to the convolution layers, the proposed OSEN approach consists of operational layers that can learn such complex non-linearities without the need for deep networks. In this way, the performance of the non-iterative support estimation is greatly improved. Moreover, the operational layers comprise so-called generative \textit{super neurons} with non-local kernels. The kernel location for each neuron/feature map is optimized jointly for the SE task during the training. We evaluate the OSENs in three different applications: i. support estimation from Compressive Sensing (CS) measurements, ii. representation-based classification, and iii. learning-aided CS reconstruction where the output of OSENs is used as prior knowledge to the CS algorithm for an enhanced reconstruction. Experimental results show that the proposed approach achieves computational efficiency and outperforms competing methods, especially at low measurement rates by a significant margin. The software implementation is publicly shared at //github.com/meteahishali/OSEN.
We consider the estimation of factor model-based variance-covariance matrix when the factor loading matrix is assumed sparse. To do so, we rely on a system of penalized estimating functions to account for the identification issue of the factor loading matrix while fostering sparsity in potentially all its entries. We prove the oracle property of the penalized estimator for the factor model when the dimension is fixed. That is, the penalization procedure can recover the true sparse support, and the estimator is asymptotically normally distributed. Consistency and recovery of the true zero entries are established when the number of parameters is diverging. These theoretical results are supported by simulation experiments, and the relevance of the proposed method is illustrated by an application to portfolio allocation.
Twin-width is a width parameter introduced by Bonnet, Kim, Thomass\'e and Watrigant [FOCS'20, JACM'22], which has many structural and algorithmic applications. We prove that the twin-width of every graph embeddable in a surface of Euler genus $g$ is $18\sqrt{47g}+O(1)$, which is asymptotically best possible as it asymptotically differs from the lower bound by a constant multiplicative factor. Our proof also yields a quadratic time algorithm to find a corresponding contraction sequence. To prove the upper bound on twin-width of graphs embeddable in surfaces, we provide a stronger version of the Product Structure Theorem for graphs of Euler genus $g$ that asserts that every such graph is a subgraph of the strong product of a path and a graph with a tree-decomposition with all bags of size at most eight with a single exceptional bag of size $\max\{8,32g-27\}$.
This work addresses the block-diagonal semidefinite program (SDP) relaxations for the clique number of the Paley graphs. The size of the maximal clique (clique number) of a graph is a classic NP-complete problem; a Paley graph is a deterministic graph where two vertices are connected if their difference is a quadratic residue modulo certain prime powers. Improving the upper bound for the Paley graph clique number for odd prime powers is an open problem in combinatorics. Moreover, since quadratic residues exhibit pseudorandom properties, Paley graphs are related to the construction of deterministic restricted isometries, an open problem in compressed sensing and sparse recovery. Recent work provides numerical evidence that the current upper bounds can be improved by the sum-of-squares (SOS) relaxations. In particular, the bounds given by the SOS relaxations of degree 4 (SOS-4) have been empirically observed to be growing at an order smaller than square root of the prime. However, computations of SOS-4 appear to be intractable with respect to large graphs. Gvozdenovic et al. introduced a more computationally efficient block-diagonal hierarchy of SDPs that refines the SOS hierarchy. They computed the values of these SDPs of degrees 2 and 3 (L2 and L3 respectively) for the Paley graph clique numbers associated with primes p less or equal to 809. These values bound from above the values of the corresponding SOS-4 and SOS-6 relaxations respectively. We revisit these computations and compute the values of the L2 relaxations for larger p's. Our results provide additional numerical evidence that the L2 relaxations, and therefore also the SOS-4 relaxations, are asymptotically growing at an order smaller than the square root of p.
Normalization is known to help the optimization of deep neural networks. Curiously, different architectures require specialized normalization methods. In this paper, we study what normalization is effective for Graph Neural Networks (GNNs). First, we adapt and evaluate the existing methods from other domains to GNNs. Faster convergence is achieved with InstanceNorm compared to BatchNorm and LayerNorm. We provide an explanation by showing that InstanceNorm serves as a preconditioner for GNNs, but such preconditioning effect is weaker with BatchNorm due to the heavy batch noise in graph datasets. Second, we show that the shift operation in InstanceNorm results in an expressiveness degradation of GNNs for highly regular graphs. We address this issue by proposing GraphNorm with a learnable shift. Empirically, GNNs with GraphNorm converge faster compared to GNNs using other normalization. GraphNorm also improves the generalization of GNNs, achieving better performance on graph classification benchmarks.
Graph Convolutional Networks (GCNs) and their variants have experienced significant attention and have become the de facto methods for learning graph representations. GCNs derive inspiration primarily from recent deep learning approaches, and as a result, may inherit unnecessary complexity and redundant computation. In this paper, we reduce this excess complexity through successively removing nonlinearities and collapsing weight matrices between consecutive layers. We theoretically analyze the resulting linear model and show that it corresponds to a fixed low-pass filter followed by a linear classifier. Notably, our experimental evaluation demonstrates that these simplifications do not negatively impact accuracy in many downstream applications. Moreover, the resulting model scales to larger datasets, is naturally interpretable, and yields up to two orders of magnitude speedup over FastGCN.