Recent works attribute the capability of in-context learning (ICL) in large pre-trained language models to implicitly simulating and fine-tuning an internal model (e.g., linear or 2-layer MLP) during inference. However, such constructions require large memory overhead, which makes simulation of more sophisticated internal models intractable. In this work, we propose an efficient construction, Transformer in Transformer (in short, TinT), that allows a transformer to simulate and fine-tune complex models internally during inference (e.g., pre-trained language models). In particular, we introduce innovative approximation techniques that allow a TinT model with less than 2 billion parameters to simulate and fine-tune a 125 million parameter transformer model within a single forward pass. TinT accommodates many common transformer variants and its design ideas also improve the efficiency of past instantiations of simple models inside transformers. We conduct end-to-end experiments to validate the internal fine-tuning procedure of TinT on various language modeling and downstream tasks. For example, even with a limited one-step budget, we observe TinT for a OPT-125M model improves performance by 4-16% absolute on average compared to OPT-125M. These findings suggest that large pre-trained language models are capable of performing intricate subroutines. To facilitate further work, a modular and extensible codebase for TinT is included.
Various collaborative distributed machine learning (CDML) systems, including federated learning systems and swarm learning systems, with different key traits were developed to leverage resources for development and use of machine learning (ML) models in a confidentiality-preserving way. To meet use case requirements, suitable CDML systems need to be selected. However, comparison between CDML systems regarding their suitability for use cases is often difficult. This work presents a CDML system conceptualization and CDML archetypes to support comparison of CDML systems and introduce scientific and practical audiences to the principal functioning and key traits of CDML systems.
This article outlines a broadly-applicable new method of statistical analysis for situations involving two competing hypotheses. Hypotheses assessment is a frequentist procedure designed to answer the question: Given the sample evidence (and assumed model), what is the relative plausibility of each hypothesis? Our aim is to determine frequentist confidences in the hypotheses that are relevant to the data at hand and are as powerful as the particular application allows. Hypotheses assessments complement significance tests because providing confidences in the hypotheses in addition to test results can better inform applied researchers about the strength of evidence provided by the data. For simple hypotheses, the method produces minimum and maximum confidences in each hypothesis. The composite case is more complex, and we introduce two conventions to aid with understanding the strength of evidence. Assessments are qualitatively different from significance test and confidence interval outcomes, and thus fill a gap in the statistician's toolkit.
We present a mechanized embedding of higher-order logic (HOL) and algebraic data types (ADT) into first-order logic with ZFC axioms. We implement this in the Lisa proof assistant for schematic first-order logic and its library based on axiomatic set theory. HOL proof steps are implemented as proof producing tactics in Lisa, and the types are interpreted as sets, with function (or arrow) types coinciding with set-theoretic function spaces. The embedded HOL proofs, as opposed to being a layer over the existing proofs, are interoperable with the existing library. This yields a form of soft type system supporting top-level polymorphism and ADTs over set theory, and offer tools to reason about functions in set theory.
Reinforcement learning (RL) tackles sequential decision-making problems by creating agents that interacts with their environment. However, existing algorithms often view these problem as static, focusing on point estimates for model parameters to maximize expected rewards, neglecting the stochastic dynamics of agent-environment interactions and the critical role of uncertainty quantification. Our research leverages the Kalman filtering paradigm to introduce a novel and scalable sampling algorithm called Langevinized Kalman Temporal-Difference (LKTD) for deep reinforcement learning. This algorithm, grounded in Stochastic Gradient Markov Chain Monte Carlo (SGMCMC), efficiently draws samples from the posterior distribution of deep neural network parameters. Under mild conditions, we prove that the posterior samples generated by the LKTD algorithm converge to a stationary distribution. This convergence not only enables us to quantify uncertainties associated with the value function and model parameters but also allows us to monitor these uncertainties during policy updates throughout the training phase. The LKTD algorithm paves the way for more robust and adaptable reinforcement learning approaches.
One of the central problems studied in the theory of machine learning is the question of whether, for a given class of hypotheses, it is possible to efficiently find a {consistent} hypothesis, i.e., which has zero training error. While problems involving {\em convex} hypotheses have been extensively studied, the question of whether efficient learning is possible for non-convex hypotheses composed of possibly several disconnected regions is still less understood. Although it has been shown quite a while ago that efficient learning of weakly convex hypotheses, a parameterized relaxation of convex hypotheses, is possible for the special case of Boolean functions, the question of whether this idea can be developed into a generic paradigm has not been studied yet. In this paper, we provide a positive answer and show that the consistent hypothesis finding problem can indeed be solved in polynomial time for a broad class of weakly convex hypotheses over metric spaces. To this end, we propose a general domain-independent algorithm for finding consistent weakly convex hypotheses and prove sufficient conditions for its efficiency that characterize the corresponding hypothesis classes. To illustrate our general algorithm and its properties, we discuss several non-trivial learning examples to demonstrate how it can be used to efficiently solve the corresponding consistent hypothesis finding problem. Without the weak convexity constraint, these problems are known to be computationally intractable. We then proceed to show that the general idea of our algorithm can even be extended to the case of extensional weakly convex hypotheses, as it naturally arise, e.g., when performing vertex classification in graphs. We prove that using our extended algorithm, the problem can be solved in polynomial time provided the distances in the domain can be computed efficiently.
Contrastive learning models have achieved great success in unsupervised visual representation learning, which maximize the similarities between feature representations of different views of the same image, while minimize the similarities between feature representations of views of different images. In text summarization, the output summary is a shorter form of the input document and they have similar meanings. In this paper, we propose a contrastive learning model for supervised abstractive text summarization, where we view a document, its gold summary and its model generated summaries as different views of the same mean representation and maximize the similarities between them during training. We improve over a strong sequence-to-sequence text generation model (i.e., BART) on three different summarization datasets. Human evaluation also shows that our model achieves better faithfulness ratings compared to its counterpart without contrastive objectives.
Geometric deep learning (GDL), which is based on neural network architectures that incorporate and process symmetry information, has emerged as a recent paradigm in artificial intelligence. GDL bears particular promise in molecular modeling applications, in which various molecular representations with different symmetry properties and levels of abstraction exist. This review provides a structured and harmonized overview of molecular GDL, highlighting its applications in drug discovery, chemical synthesis prediction, and quantum chemistry. Emphasis is placed on the relevance of the learned molecular features and their complementarity to well-established molecular descriptors. This review provides an overview of current challenges and opportunities, and presents a forecast of the future of GDL for molecular sciences.
This paper aims to mitigate straggler effects in synchronous distributed learning for multi-agent reinforcement learning (MARL) problems. Stragglers arise frequently in a distributed learning system, due to the existence of various system disturbances such as slow-downs or failures of compute nodes and communication bottlenecks. To resolve this issue, we propose a coded distributed learning framework, which speeds up the training of MARL algorithms in the presence of stragglers, while maintaining the same accuracy as the centralized approach. As an illustration, a coded distributed version of the multi-agent deep deterministic policy gradient(MADDPG) algorithm is developed and evaluated. Different coding schemes, including maximum distance separable (MDS)code, random sparse code, replication-based code, and regular low density parity check (LDPC) code are also investigated. Simulations in several multi-robot problems demonstrate the promising performance of the proposed framework.
Graph neural networks (GNNs) are a popular class of machine learning models whose major advantage is their ability to incorporate a sparse and discrete dependency structure between data points. Unfortunately, GNNs can only be used when such a graph-structure is available. In practice, however, real-world graphs are often noisy and incomplete or might not be available at all. With this work, we propose to jointly learn the graph structure and the parameters of graph convolutional networks (GCNs) by approximately solving a bilevel program that learns a discrete probability distribution on the edges of the graph. This allows one to apply GCNs not only in scenarios where the given graph is incomplete or corrupted but also in those where a graph is not available. We conduct a series of experiments that analyze the behavior of the proposed method and demonstrate that it outperforms related methods by a significant margin.
Deep learning has yielded state-of-the-art performance on many natural language processing tasks including named entity recognition (NER). However, this typically requires large amounts of labeled data. In this work, we demonstrate that the amount of labeled training data can be drastically reduced when deep learning is combined with active learning. While active learning is sample-efficient, it can be computationally expensive since it requires iterative retraining. To speed this up, we introduce a lightweight architecture for NER, viz., the CNN-CNN-LSTM model consisting of convolutional character and word encoders and a long short term memory (LSTM) tag decoder. The model achieves nearly state-of-the-art performance on standard datasets for the task while being computationally much more efficient than best performing models. We carry out incremental active learning, during the training process, and are able to nearly match state-of-the-art performance with just 25\% of the original training data.