In this paper, we provide a geometric interpretation of the structure of Deep Learning (DL) networks, characterized by $L$ hidden layers, a ReLU ramp activation function, an $\mathcal{L}^2$ Schatten class (or Hilbert-Schmidt) cost function, and input and output spaces $\mathbb{R}^Q$ with equal dimension $Q\geq1$. The hidden layers are also defined on $\mathbb{R}^{Q}$; the training input size $N$ can be arbitrarily large - thus, we are considering the underparametrized regime. We apply our recent results on shallow neural networks to construct an explicit family of minimizers for the global minimum of the cost function in the case $L\geq Q$, which we show to be degenerate. In the context presented here, the hidden layers of the DL network "curate" the training inputs by recursive application of a truncation map that minimizes the noise to signal ratio of the training inputs. Moreover, we determine a set of $2^Q-1$ distinct degenerate local minima of the cost function. Our constructions make no use of gradient descent algorithms at all.
The interest in network analysis of bibliographic data has grown substantially in recent years, yet comprehensive statistical models for examining the complete dynamics of scientific networks based on bibliographic data are generally lacking. Current empirical studies often focus on models restricting analysis either to paper citation networks (paper-by-paper) or author networks (author-by-author). However, such networks encompass not only direct connections between papers, but also indirect relationships between the references of papers connected by a citation link. In this paper, we extend recently developed relational hyperevent models (RHEM) for analyzing scientific networks. We introduce new covariates representing theoretically meaningful and empirically interesting sub-network configurations. The model accommodates testing hypotheses considering: (i) the polyadic nature of scientific publication events, and (ii) the interdependencies between authors and references of current and prior papers. We implement the model using purpose-built, publicly available open-source software, demonstrating its empirical value in an analysis of a large publicly available scientific network dataset. Assessing the relative strength of various effects reveals that both the hyperedge structure of publication events, as well as the interconnection between authors and references significantly improve our understanding and interpretation of collaborative scientific production.
We have widely observed that neural networks are vulnerable to small additive perturbations to the input causing misclassification. In this paper, we focus on the $\ell_0$-bounded adversarial attacks, and aim to theoretically characterize the performance of adversarial training for an important class of truncated classifiers. Such classifiers are shown to have strong performance empirically, as well as theoretically in the Gaussian mixture model, in the $\ell_0$-adversarial setting. The main contribution of this paper is to prove a novel generalization bound for the binary classification setting with $\ell_0$-bounded adversarial perturbation that is distribution-independent. Deriving a generalization bound in this setting has two main challenges: (i) the truncated inner product which is highly non-linear; and (ii) maximization over the $\ell_0$ ball due to adversarial training is non-convex and highly non-smooth. To tackle these challenges, we develop new coding techniques for bounding the combinatorial dimension of the truncated hypothesis class.
Handling multiplicity without losing much power has been a persistent challenge in various fields that often face the necessity of managing numerous statistical tests simultaneously. Recently, $p$-value combination methods based on heavy-tailed distributions, such as a Cauchy distribution, have received much attention for their ability to handle multiplicity without the prescribed knowledge of the dependence structure. This paper delves into these types of $p$-value combinations through the lens of extreme value theory. Distributions with regularly varying tails, a subclass of heavy tail distributions, are found to be useful in constructing such $p$-value combinations. Three $p$-value combination statistics (sum, max cumulative sum, and max) are introduced, of which left tail probabilities are shown to be approximately uniform when the global null is true. The primary objective of this paper is to bridge the gap between current developments in $p$-value combination methods and the literature on extreme value theory, while also offering guidance on selecting the calibrator and its associated parameters.
In this paper, we introduce a Bayesian learning model to understand the behavior of Large Language Models (LLMs). We explore the optimization metric of LLMs, which is based on predicting the next token, and develop a novel model grounded in this principle. Our approach involves constructing an ideal generative text model represented by a multinomial transition probability matrix with a prior, and we examine how LLMs approximate this matrix. We discuss the continuity of the mapping between embeddings and multinomial distributions, and present the Dirichlet approximation theorem to approximate any prior. Additionally, we demonstrate how text generation by LLMs aligns with Bayesian learning principles and delve into the implications for in-context learning, specifically explaining why in-context learning emerges in larger models where prompts are considered as samples to be updated. Our findings indicate that the behavior of LLMs is consistent with Bayesian Learning, offering new insights into their functioning and potential applications.
For a sequence of random structures with $n$-element domains over a relational signature, we define its first order (FO) complexity as a certain subset in the Banach space $\ell^{\infty}/c_0$. The well-known FO zero-one law and FO convergence law correspond to FO complexities equal to $\{0,1\}$ and a subset of $\mathbb{R}$, respectively. We present a hierarchy of FO complexity classes, introduce a stochastic FO reduction that allows to transfer complexity results between different random structures, and deduce using this tool several new logical limit laws for binomial random structures. Finally, we introduce a conditional distribution on graphs, subject to a FO sentence $\varphi$, that generalises certain well-known random graph models, show instances of this distribution for every complexity class, and prove that the set of all $\varphi$ validating 0--1 law is not recursively enumerable.
In this paper, we consider the hull of an algebraic geometry code, meaning the intersection of the code and its dual. We demonstrate how codes whose hulls are algebraic geometry codes may be defined using only rational places of Kummer extensions (and Hermitian function fields in particular). Our primary tool is explicitly constructing non-special divisors of degrees $g$ and $g-1$ on certain families of function fields with many rational places, accomplished by appealing to Weierstrass semigroups. We provide explicit algebraic geometry codes with hulls of specified dimensions, producing along the way linearly complementary dual algebraic geometric codes from the Hermitian function field (among others) using only rational places and an answer to an open question posed by Ballet and Le Brigand for particular function fields. These results complement earlier work by Mesnager, Tang, and Qi that use lower-genus function fields as well as instances using places of a higher degree from Hermitian function fields to construct linearly complementary dual (LCD) codes and that of Carlet, Mesnager, Tang, Qi, and Pellikaan to provide explicit algebraic geometry codes with the LCD property rather than obtaining codes via monomial equivalences.
Transformers are widely used to extract semantic meanings from input tokens, yet they usually operate as black-box models. In this paper, we present a simple yet informative decomposition of hidden states (or embeddings) of trained transformers into interpretable components. For any layer, embedding vectors of input sequence samples are represented by a tensor $\boldsymbol{h} \in \mathbb{R}^{C \times T \times d}$. Given embedding vector $\boldsymbol{h}_{c,t} \in \mathbb{R}^d$ at sequence position $t \le T$ in a sequence (or context) $c \le C$, extracting the mean effects yields the decomposition \[ \boldsymbol{h}_{c,t} = \boldsymbol{\mu} + \mathbf{pos}_t + \mathbf{ctx}_c + \mathbf{resid}_{c,t} \] where $\boldsymbol{\mu}$ is the global mean vector, $\mathbf{pos}_t$ and $\mathbf{ctx}_c$ are the mean vectors across contexts and across positions respectively, and $\mathbf{resid}_{c,t}$ is the residual vector. For popular transformer architectures and diverse text datasets, empirically we find pervasive mathematical structure: (1) $(\mathbf{pos}_t)_{t}$ forms a low-dimensional, continuous, and often spiral shape across layers, (2) $(\mathbf{ctx}_c)_c$ shows clear cluster structure that falls into context topics, and (3) $(\mathbf{pos}_t)_{t}$ and $(\mathbf{ctx}_c)_c$ are mutually nearly orthogonal. We argue that smoothness is pervasive and beneficial to transformers trained on languages, and our decomposition leads to improved model interpretability.
In this paper, we consider feature screening for ultrahigh dimensional clustering analyses. Based on the observation that the marginal distribution of any given feature is a mixture of its conditional distributions in different clusters, we propose to screen clustering features by independently evaluating the homogeneity of each feature's mixture distribution. Important cluster-relevant features have heterogeneous components in their mixture distributions and unimportant features have homogeneous components. The well-known EM-test statistic is used to evaluate the homogeneity. Under general parametric settings, we establish the tail probability bounds of the EM-test statistic for the homogeneous and heterogeneous features, and further show that the proposed screening procedure can achieve the sure independent screening and even the consistency in selection properties. Limiting distribution of the EM-test statistic is also obtained for general parametric distributions. The proposed method is computationally efficient, can accurately screen for important cluster-relevant features and help to significantly improve clustering, as demonstrated in our extensive simulation and real data analyses.
This study introduces a two-scale Graph Neural Operator (GNO), namely, LatticeGraphNet (LGN), designed as a surrogate model for costly nonlinear finite-element simulations of three-dimensional latticed parts and structures. LGN has two networks: LGN-i, learning the reduced dynamics of lattices, and LGN-ii, learning the mapping from the reduced representation onto the tetrahedral mesh. LGN can predict deformation for arbitrary lattices, therefore the name operator. Our approach significantly reduces inference time while maintaining high accuracy for unseen simulations, establishing the use of GNOs as efficient surrogate models for evaluating mechanical responses of lattices and structures.
We build on the theory of ontology logs (ologs) created by Spivak and Kent, and define a notion of wiring diagrams. In this article, a wiring diagram is a finite directed labelled graph. The labels correspond to types in an olog; they can also be interpreted as readings of sensors in an autonomous system. As such, wiring diagrams can be used as a framework for an autonomous system to form abstract concepts. We show that the graphs underlying skeleton wiring diagrams form a category. This allows skeleton wiring diagrams to be compared and manipulated using techniques from both graph theory and category theory. We also extend the usual definition of graph edit distance to the case of wiring diagrams by using operations only available to wiring diagrams, leading to a metric on the set of all skeleton wiring diagrams. In the end, we give an extended example on calculating the distance between two concepts represented by wiring diagrams, and explain how to apply our framework to any application domain.