The Lipschitz constant of the map between the input and output space represented by a neural network is a natural metric for assessing the robustness of the model. We present a new method to constrain the Lipschitz constant of dense deep learning models that can also be generalized to other architectures. The method relies on a simple weight normalization scheme during training that ensures the Lipschitz constant of every layer is below an upper limit specified by the analyst. A simple monotonic residual connection can then be used to make the model monotonic in any subset of its inputs, which is useful in scenarios where domain knowledge dictates such dependence. Examples can be found in algorithmic fairness requirements or, as presented here, in the classification of the decays of subatomic particles produced at the CERN Large Hadron Collider. Our normalization is minimally constraining and allows the underlying architecture to maintain higher expressiveness compared to other techniques which aim to either control the Lipschitz constant of the model or ensure its monotonicity. We show how the algorithm was used to train a powerful, robust, and interpretable discriminator for heavy-flavor-quark decays, which has been adopted for use as the primary data-selection algorithm in the LHCb real-time data-processing system in the current LHC data-taking period known as Run 3. In addition, our algorithm has also achieved state-of-the-art performance on benchmarks in medicine, finance, and other applications.
Linear Recurrence Sequences (LRS) are a fundamental mathematical primitive for a plethora of applications such as model checking, probabilistic systems, computational biology, and economics. Positivity (are all terms of the given LRS at least 0?) and Ultimate Positivity (are all but finitely many terms of the given LRS at least 0?) are important open number-theoretic decision problems. Recently, the robust versions of these problems, that ask whether the LRS is (Ultimately) Positive despite small perturbations to its initialisation, have gained attention as a means to model the imprecision that arises in practical settings. In this paper, we consider Robust Positivity and Ultimate Positivity problems where the neighbourhood of the initialisation, specified in a natural and general format, is also part of the input. We contribute by proving sharp decidability results: decision procedures at orders our techniques can't handle would entail significant number-theoretic breakthroughs.
Purpose: We propose a novel method for continual learning based on the increasing depth of neural networks. This work explores whether extending neural network depth may be beneficial in a life-long learning setting. Methods: We propose a novel approach based on adding new layers on top of existing ones to enable the forward transfer of knowledge and adapting previously learned representations. We employ a method of determining the most similar tasks for selecting the best location in our network to add new nodes with trainable parameters. This approach allows for creating a tree-like model, where each node is a set of neural network parameters dedicated to a specific task. The Progressive Neural Network concept inspires the proposed method. Therefore, it benefits from dynamic changes in network structure. However, Progressive Neural Network allocates a lot of memory for the whole network structure during the learning process. The proposed method alleviates this by adding only part of a network for a new task and utilizing a subset of previously trained weights. At the same time, we may retain the benefit of PNN, such as no forgetting guaranteed by design, without needing a memory buffer. Results: Experiments on Split CIFAR and Split Tiny ImageNet show that the proposed algorithm is on par with other continual learning methods. In a more challenging setup with a single computer vision dataset as a separate task, our method outperforms Experience Replay. Conclusion: It is compatible with commonly used computer vision architectures and does not require a custom network structure. As an adaptation to changing data distribution is made by expanding the architecture, there is no need to utilize a rehearsal buffer. For this reason, our method could be used for sensitive applications where data privacy must be considered.
In many industrial applications, obtaining labeled observations is not straightforward as it often requires the intervention of human experts or the use of expensive testing equipment. In these circumstances, active learning can be highly beneficial in suggesting the most informative data points to be used when fitting a model. Reducing the number of observations needed for model development alleviates both the computational burden required for training and the operational expenses related to labeling. Online active learning, in particular, is useful in high-volume production processes where the decision about the acquisition of the label for a data point needs to be taken within an extremely short time frame. However, despite the recent efforts to develop online active learning strategies, the behavior of these methods in the presence of outliers has not been thoroughly examined. In this work, we investigate the performance of online active linear regression in contaminated data streams. Our study shows that the currently available query strategies are prone to sample outliers, whose inclusion in the training set eventually degrades the predictive performance of the models. To address this issue, we propose a solution that bounds the search area of a conditional D-optimal algorithm and uses a robust estimator. Our approach strikes a balance between exploring unseen regions of the input space and protecting against outliers. Through numerical simulations, we show that the proposed method is effective in improving the performance of online active learning in the presence of outliers, thus expanding the potential applications of this powerful tool.
The time continuous Volterra equations valued in $\mathbb{R}$ with completely positive kernels have two basic monotonicity properties. The first is that any two solution curves do not intersect with suitable given signals. The second is that the solutions to the autonomous equations are monotone. Due to the fading memory principle, we also desire the kernels to be nonincreasing. In this work, through an generalization of the convolution to nonuniform meshes, we introduce the concept of ``right complementary monotone'' (R-CMM) kernels in the discrete level for nonuniform meshes, which inherits both the nonincreasing property and complete positivity in the continuous level. We prove that the discrete solutions preserve these two monotonicity properties if the discretized kernel satisfies R-CMM property. Technically, we highly rely on the resolvent kernels to achieve this.
The emergence of new applications brings multi-class traffic with diverse quality of service (QoS) demands in wide area networks (WANs), which motivates the research in traffic engineering (TE). In recent years, novel centralized TE schemes have employed heuristic or machine-learning techniques to orchestrate resources in closed systems, such as datacenter networks. However, these schemes suffer long delivery delay and high control overhead when applied to general WANs. Semi-centralized TE schemes have been proposed to address these drawbacks, providing lower delay and control overhead. Despite this, they suffer performance degradation dealing with volatile traffic. To provide low-delay service and keep high network utility, we propose an asynchronous multi-class traffic management scheme, AMTM. We first establish an asynchronous TE paradigm, in which distributed nodes instantly make traffic control decisions based on link prices. To manage varying traffic and control delivery time, we propose state-based iteration strategies of link pricing under different scenarios and investigate their convergence. Furthermore, we present a system design and corresponding algorithms. Simulation results indicate that AMTM outperforms existing schemes in terms of both delay reduction and scalability improvement. In addition, AMTM outperforms the semi-centralized scheme with 12-20$\%$ more network utility and achieves 2-7$\%$ less network utility compared to the theoretical optimum.
Thanks to their universal approximation properties and new efficient training strategies, Deep Neural Networks are becoming a valuable tool for the approximation of mathematical operators. In the present work, we introduce Mesh-Informed Neural Networks (MINNs), a class of architectures specifically tailored to handle mesh based functional data, and thus of particular interest for reduced order modeling of parametrized Partial Differential Equations (PDEs). The driving idea behind MINNs is to embed hidden layers into discrete functional spaces of increasing complexity, obtained through a sequence of meshes defined over the underlying spatial domain. The approach leads to a natural pruning strategy which enables the design of sparse architectures that are able to learn general nonlinear operators. We assess this strategy through an extensive set of numerical experiments, ranging from nonlocal operators to nonlinear diffusion PDEs, where MINNs are compared against more traditional architectures, such as classical fully connected Deep Neural Networks, but also more recent ones, such as DeepONets and Fourier Neural Operators. Our results show that MINNs can handle functional data defined on general domains of any shape, while ensuring reduced training times, lower computational costs, and better generalization capabilities, thus making MINNs very well-suited for demanding applications such as Reduced Order Modeling and Uncertainty Quantification for PDEs.
Graph Transformer is gaining increasing attention in the field of machine learning and has demonstrated state-of-the-art performance on benchmarks for graph representation learning. However, as current implementations of Graph Transformer primarily focus on learning representations of small-scale graphs, the quadratic complexity of the global self-attention mechanism presents a challenge for full-batch training when applied to larger graphs. Additionally, conventional sampling-based methods fail to capture necessary high-level contextual information, resulting in a significant loss of performance. In this paper, we introduce the Hierarchical Scalable Graph Transformer (HSGT) as a solution to these challenges. HSGT successfully scales the Transformer architecture to node representation learning tasks on large-scale graphs, while maintaining high performance. By utilizing graph hierarchies constructed through coarsening techniques, HSGT efficiently updates and stores multi-scale information in node embeddings at different levels. Together with sampling-based training methods, HSGT effectively captures and aggregates multi-level information on the hierarchical graph using only Transformer blocks. Empirical evaluations demonstrate that HSGT achieves state-of-the-art performance on large-scale benchmarks with graphs containing millions of nodes with high efficiency.
In many board games and other abstract games, patterns have been used as features that can guide automated game-playing agents. Such patterns or features often represent particular configurations of pieces, empty positions, etc., which may be relevant for a game's strategies. Their use has been particularly prevalent in the game of Go, but also many other games used as benchmarks for AI research. In this paper, we formulate a design and efficient implementation of spatial state-action features for general games. These are patterns that can be trained to incentivise or disincentivise actions based on whether or not they match variables of the state in a local area around action variables. We provide extensive details on several design and implementation choices, with a primary focus on achieving a high degree of generality to support a wide variety of different games using different board geometries or other graphs. Secondly, we propose an efficient approach for evaluating active features for any given set of features. In this approach, we take inspiration from heuristics used in problems such as SAT to optimise the order in which parts of patterns are matched and prune unnecessary evaluations. This approach is defined for a highly general and abstract description of the problem -- phrased as optimising the order in which propositions of formulas in disjunctive normal form are evaluated -- and may therefore also be of interest to other types of problems than board games. An empirical evaluation on 33 distinct games in the Ludii general game system demonstrates the efficiency of this approach in comparison to a naive baseline, as well as a baseline based on prefix trees, and demonstrates that the additional efficiency significantly improves the playing strength of agents using the features to guide search.
When is heterogeneity in the composition of an autonomous robotic team beneficial and when is it detrimental? We investigate and answer this question in the context of a minimally viable model that examines the role of heterogeneous speeds in perimeter defense problems, where defenders share a total allocated speed budget. We consider two distinct problem settings and develop strategies based on dynamic programming and on local interaction rules. We present a theoretical analysis of both approaches and our results are extensively validated using simulations. Interestingly, our results demonstrate that the viability of heterogeneous teams depends on the amount of information available to the defenders. Moreover, our results suggest a universality property: across a wide range of problem parameters the optimal ratio of the speeds of the defenders remains nearly constant.
Graph Neural Networks (GNNs) for representation learning of graphs broadly follow a neighborhood aggregation framework, where the representation vector of a node is computed by recursively aggregating and transforming feature vectors of its neighboring nodes. Many GNN variants have been proposed and have achieved state-of-the-art results on both node and graph classification tasks. However, despite GNNs revolutionizing graph representation learning, there is limited understanding of their representational properties and limitations. Here, we present a theoretical framework for analyzing the expressive power of GNNs in capturing different graph structures. Our results characterize the discriminative power of popular GNN variants, such as Graph Convolutional Networks and GraphSAGE, and show that they cannot learn to distinguish certain simple graph structures. We then develop a simple architecture that is provably the most expressive among the class of GNNs and is as powerful as the Weisfeiler-Lehman graph isomorphism test. We empirically validate our theoretical findings on a number of graph classification benchmarks, and demonstrate that our model achieves state-of-the-art performance.