Recurrent neural networks (RNNs) notoriously struggle to learn long-term memories, primarily due to vanishing and exploding gradients. The recent success of state-space models (SSMs), a subclass of RNNs, to overcome such difficulties challenges our theoretical understanding. In this paper, we delve into the optimization challenges of RNNs and discover that, as the memory of a network increases, changes in its parameters result in increasingly large output variations, making gradient-based learning highly sensitive, even without exploding gradients. Our analysis further reveals the importance of the element-wise recurrence design pattern combined with careful parametrizations in mitigating this effect. This feature is present in SSMs, as well as in other architectures, such as LSTMs. Overall, our insights provide a new explanation for some of the difficulties in gradient-based learning of RNNs and why some architectures perform better than others.
Neural networks are a group of neurons stacked together in multiple layers to mimic the biological neurons in a human brain. Neural networks have been trained using the backpropagation algorithm based on gradient descent strategy for several decades. Several variants have been developed to improve the backpropagation algorithm. The loss function for the neural network is optimized through backpropagation, but several local minima exist in the manifold of the constructed neural network. We obtain several solutions matching the minima. The gradient descent strategy cannot avoid the problem of local minima and gets stuck in the minima due to the initialization. Particle swarm optimization (PSO) was proposed to select the best local minima among the search space of the loss function. The search space is limited to the instantiated particles in the PSO algorithm, and sometimes it cannot select the best solution. In the proposed approach, we overcome the problem of gradient descent and the limitation of the PSO algorithm by training individual neurons separately, capable of collectively solving the problem as a group of neurons forming a network. Our code and data are available at //github.com/dipkmr/train-nn-wobp/
A statistical network model with overlapping communities can be generated as a superposition of mutually independent random graphs of varying size. The model is parameterized by the number of nodes, the number of communities, and the joint distribution of the community size and the edge probability. This model admits sparse parameter regimes with power-law limiting degree distributions and non-vanishing clustering coefficients. This article presents large-scale approximations of clique and cycle frequencies for graph samples generated by the model, which are valid for regimes with unbounded numbers of overlapping communities. Our results reveal the growth rates of these subgraph frequencies and show that their theoretical densities can be reliably estimated from data.
Powerful deep neural networks are vulnerable to adversarial attacks. To obtain adversarially robust models, researchers have separately developed adversarial training and Jacobian regularization techniques. There are abundant theoretical and empirical studies for adversarial training, but theoretical foundations for Jacobian regularization are still lacking. In this study, we show that Jacobian regularization is closely related to adversarial training in that $\ell_{2}$ or $\ell_{1}$ Jacobian regularized loss serves as an approximate upper bound on the adversarially robust loss under $\ell_{2}$ or $\ell_{\infty}$ adversarial attack respectively. Further, we establish the robust generalization gap for Jacobian regularized risk minimizer via bounding the Rademacher complexity of both the standard loss function class and Jacobian regularization function class. Our theoretical results indicate that the norms of Jacobian are related to both standard and robust generalization. We also perform experiments on MNIST data classification to demonstrate that Jacobian regularized risk minimization indeed serves as a surrogate for adversarially robust risk minimization, and that reducing the norms of Jacobian can improve both standard and robust generalization. This study promotes both theoretical and empirical understandings to adversarially robust generalization via Jacobian regularization.
Most of the scientific literature on causal modeling considers the structural framework of Pearl and the potential-outcome framework of Rubin to be formally equivalent, and therefore interchangeably uses do-interventions and the potential-outcome subscript notation to write counterfactual outcomes. In this paper, we agnostically superimpose the two causal models to specify under which mathematical conditions structural counterfactual outcomes and potential outcomes need to, do not need to, can, or cannot be equal (almost surely or law). Our comparison reminds that a structural causal model and a Rubin causal model compatible with the same observations do not have to coincide, and highlights real-world problems where they even cannot correspond. Then, we examine common claims and practices from the causal-inference literature in the light of these results. In doing so, we aim at clarifying the relationship between the two causal frameworks, and the interpretation of their respective counterfactuals.
This study presents a novel representation learning model tailored for dynamic networks, which describes the continuously evolving relationships among individuals within a population. The problem is encapsulated in the dimension reduction topic of functional data analysis. With dynamic networks represented as matrix-valued functions, our objective is to map this functional data into a set of vector-valued functions in a lower-dimensional learning space. This space, defined as a metric functional space, allows for the calculation of norms and inner products. By constructing this learning space, we address (i) attribute learning, (ii) community detection, and (iii) link prediction and recovery of individual nodes in the dynamic network. Our model also accommodates asymmetric low-dimensional representations, enabling the separate study of nodes' regulatory and receiving roles. Crucially, the learning method accounts for the time-dependency of networks, ensuring that representations are continuous over time. The functional learning space we define naturally spans the time frame of the dynamic networks, facilitating both the inference of network links at specific time points and the reconstruction of the entire network structure without direct observation. We validated our approach through simulation studies and real-world applications. In simulations, we compared our methods link prediction performance to existing approaches under various data corruption scenarios. For real-world applications, we examined a dynamic social network replicated across six ant populations, demonstrating that our low-dimensional learning space effectively captures interactions, roles of individual ants, and the social evolution of the network. Our findings align with existing knowledge of ant colony behavior.
For several types of information relations, the induced rough sets system RS does not form a lattice but only a partially ordered set. However, by studying its Dedekind-MacNeille completion DM(RS), one may reveal new important properties of rough set structures. Building upon D. Umadevi's work on describing joins and meets in DM(RS), we previously investigated pseudo-Kleene algebras defined on DM(RS) for reflexive relations. This paper delves deeper into the order-theoretic properties of DM(RS) in the context of reflexive relations. We describe the completely join-irreducible elements of DM(RS) and characterize when DM(RS) is a spatial completely distributive lattice. We show that even in the case of a non-transitive reflexive relation, DM(RS) can form a Nelson algebra, a property generally associated with quasiorders. We introduce a novel concept, the core of a relational neighborhood, and use it to provide a necessary and sufficient condition for DM(RS) to determine a Nelson algebra.
We investigate a Tikhonov regularization scheme specifically tailored for shallow neural networks within the context of solving a classic inverse problem: approximating an unknown function and its derivatives within a unit cubic domain based on noisy measurements. The proposed Tikhonov regularization scheme incorporates a penalty term that takes three distinct yet intricately related network (semi)norms: the extended Barron norm, the variation norm, and the Radon-BV seminorm. These choices of the penalty term are contingent upon the specific architecture of the neural network being utilized. We establish the connection between various network norms and particularly trace the dependence of the dimensionality index, aiming to deepen our understanding of how these norms interplay with each other. We revisit the universality of function approximation through various norms, establish rigorous error-bound analysis for the Tikhonov regularization scheme, and explicitly elucidate the dependency of the dimensionality index, providing a clearer understanding of how the dimensionality affects the approximation performance and how one designs a neural network with diverse approximating tasks.
While neural networks can enjoy an outstanding flexibility and exhibit unprecedented performance, the mechanism behind their behavior is still not well-understood. To tackle this fundamental challenge, researchers have tried to restrict and manipulate some of their properties in order to gain new insights and better control on them. Especially, throughout the past few years, the concept of \emph{bi-Lipschitzness} has been proved as a beneficial inductive bias in many areas. However, due to its complexity, the design and control of bi-Lipschitz architectures are falling behind, and a model that is precisely designed for bi-Lipschitzness realizing a direct and simple control of the constants along with solid theoretical analysis is lacking. In this work, we investigate and propose a novel framework for bi-Lipschitzness that can achieve such a clear and tight control based on convex neural networks and the Legendre-Fenchel duality. Its desirable properties are illustrated with concrete experiments. We also apply this framework to uncertainty estimation and monotone problem settings to illustrate its broad range of applications.
Eye movements provide a window into human behaviour, attention, and interaction dynamics. Challenges in real-world, multi-person environments have, however, restrained eye-tracking research predominantly to single-person, in-lab settings. We developed a system to stream, record, and analyse synchronised data from multiple mobile eye-tracking devices during collective viewing experiences (e.g., concerts, films, lectures). We implemented lightweight operator interfaces for real-time-monitoring, remote-troubleshooting, and gaze-projection from individual egocentric perspectives to a common coordinate space for shared gaze analysis. We tested the system in a live concert and a film screening with 30 simultaneous viewers during each of two public events (N=60). We observe precise time-synchronisation between devices measured through recorded clock-offsets, and accurate gaze-projection in challenging dynamic scenes. Our novel analysis metrics and visualizations illustrate the potential of collective eye-tracking data for understanding collaborative behaviour and social interaction. This advancement promotes ecological validity in eye-tracking research and paves the way for innovative interactive tools.
Recent advances in 3D fully convolutional networks (FCN) have made it feasible to produce dense voxel-wise predictions of volumetric images. In this work, we show that a multi-class 3D FCN trained on manually labeled CT scans of several anatomical structures (ranging from the large organs to thin vessels) can achieve competitive segmentation results, while avoiding the need for handcrafting features or training class-specific models. To this end, we propose a two-stage, coarse-to-fine approach that will first use a 3D FCN to roughly define a candidate region, which will then be used as input to a second 3D FCN. This reduces the number of voxels the second FCN has to classify to ~10% and allows it to focus on more detailed segmentation of the organs and vessels. We utilize training and validation sets consisting of 331 clinical CT images and test our models on a completely unseen data collection acquired at a different hospital that includes 150 CT scans, targeting three anatomical organs (liver, spleen, and pancreas). In challenging organs such as the pancreas, our cascaded approach improves the mean Dice score from 68.5 to 82.2%, achieving the highest reported average score on this dataset. We compare with a 2D FCN method on a separate dataset of 240 CT scans with 18 classes and achieve a significantly higher performance in small organs and vessels. Furthermore, we explore fine-tuning our models to different datasets. Our experiments illustrate the promise and robustness of current 3D FCN based semantic segmentation of medical images, achieving state-of-the-art results. Our code and trained models are available for download: //github.com/holgerroth/3Dunet_abdomen_cascade.