Causal abstraction (CA) theory establishes formal criteria for relating multiple structural causal models (SCMs) at different levels of granularity by defining maps between them. These maps have significant relevance for real-world challenges such as synthesizing causal evidence from multiple experimental environments, learning causally consistent representations at different resolutions, and linking interventions across multiple SCMs. In this work, we propose COTA, the first method to learn abstraction maps from observational and interventional data without assuming complete knowledge of the underlying SCMs. In particular, we introduce a multi-marginal Optimal Transport (OT) formulation that enforces do-calculus causal constraints, together with a cost function that relies on interventional information. We extensively evaluate COTA on synthetic and real world problems, and showcase its advantages over non-causal, independent and aggregated COTA formulations. Finally, we demonstrate the efficiency of our method as a data augmentation tool by comparing it against the state-of-the-art CA learning framework, which assumes fully specified SCMs, on a real-world downstream task.
Estimating the statistics of the state of a dynamical system, from partial and noisy observations, is both mathematically challenging and finds wide application. Furthermore, the applications are of great societal importance, including problems such as probabilistic weather forecasting and prediction of epidemics. Particle filters provide a well-founded approach to the problem, leading to provably accurate approximations of the statistics. However these methods perform poorly in high dimensions. In 1994 the idea of ensemble Kalman filtering was introduced by Evensen, leading to a methodology that has been widely adopted in the geophysical sciences and also finds application to quite general inverse problems. However, ensemble Kalman filters have defied rigorous analysis of their statistical accuracy, except in the linear Gaussian setting. In this article we describe recent work which takes first steps to analyze the statistical accuracy of ensemble Kalman filters beyond the linear Gaussian setting. The subject is inherently technical, as it involves the evolution of probability measures according to a nonlinear and nonautonomous dynamical system; and the approximation of this evolution. It can nonetheless be presented in a fairly accessible fashion, understandable with basic knowledge of dynamical systems, numerical analysis and probability.
Presenting high-level arguments is a crucial task for fostering participation in online societal discussions. Current argument summarization approaches miss an important facet of this task -- capturing diversity -- which is important for accommodating multiple perspectives. We introduce three aspects of diversity: those of opinions, annotators, and sources. We evaluate approaches to a popular argument summarization task called Key Point Analysis, which shows how these approaches struggle to (1) represent arguments shared by few people, (2) deal with data from various sources, and (3) align with subjectivity in human-provided annotations. We find that both general-purpose LLMs and dedicated KPA models exhibit this behavior, but have complementary strengths. Further, we observe that diversification of training data may ameliorate generalization. Addressing diversity in argument summarization requires a mix of strategies to deal with subjectivity.
The Sinkhorn algorithm is the state-of-the-art to approximate solutions of entropic optimal transport (OT) distances between discrete probability distributions. We show that meticulously training a neural network to learn initializations to the algorithm via the entropic OT dual problem can significantly speed up convergence, while maintaining desirable properties of the Sinkhorn algorithm, such as differentiability and parallelizability. We train our predictive network in an adversarial fashion using a second, generating network and a self-supervised bootstrapping loss. The predictive network is universal in the sense that it is able to generalize to any pair of distributions of fixed dimension and cost at inference, and we prove that we can make the generating network universal in the sense that it is capable of producing any pair of distributions during training. Furthermore, we show that our network can even be used as a standalone OT solver to approximate regularized transport distances to a few percent error, which makes it the first meta neural OT solver.
The automorphism groups of various linear codes are well-studied yielding valuable insights into the respective code structure. This knowledge is successfully applied in, e.g., theoretical analysis and in improving decoding performance motivating the analyses of endomorphisms of linear codes. In this work, we discuss the structure of the set of transformation matrices of code endomorphisms, defined as a generalization of code automorphisms, and provide an explicit construction of a bijective mapping between the image of an endomorphism and its canonical quotient space. Furthermore, we introduce a one-to-one mapping between the set of transformation matrices of endomorphisms and a larger linear block code enabling the use of well-known algorithms for the search for suitable endomorphisms. Additionally, we propose an approach to obtain unknown code endomorphisms based on automorphisms of the code. Furthermore, we consider ensemble decoding as a possible use case for endomorphisms by introducing endomorphism ensemble decoding. Interestingly, EED can improve decoding performance when other ensemble decoding schemes are not applicable.
A process algebra is proposed, whose semantics maps a term to a nondeterministic finite automaton (NFA, for short). We prove a representability theorem: for each NFA $N$, there exists a process algebraic term $p$ such that its semantics is an NFA isomorphic to $N$. Moreover, we provide a concise axiomatization of language equivalence: two NFAs $N_1$ and $N_2$ recognize the same language if and only if the associated terms $p_1$ and $p_2$, respectively, can be equated by means of a set of axioms, comprising 7 axioms plus 3 conditional axioms, only.
Bayesian model updating facilitates the calibration of analytical models based on observations and the quantification of uncertainties in model parameters such as stiffness and mass. This process significantly enhances damage assessment and response predictions in existing civil structures. Predominantly, current methods employ modal properties identified from acceleration measurements to evaluate the likelihood of the model parameters. This modal analysis-based likelihood generally involves a prior assumption regarding the mass parameters. In civil structures, accurately determining mass parameters proves challenging owing to the time-varying nature of imposed loads. The resulting inaccuracy potentially introduces biases while estimating the stiffness parameters, which affects the assessment of structural response and associated damage. Addressing this issue, the present study introduces a stress-resultant-based approach for Bayesian model updating independent of mass assumptions. This approach utilizes system identification on strain and acceleration measurements to establish the relationship between nodal displacements and elemental stress resultants. Employing static analysis to depict this relationship aids in assessing the likelihood of stiffness parameters. Integrating this static-analysis-based likelihood with a modal-analysis-based likelihood facilitates the simultaneous estimation of mass and stiffness parameters. The proposed approach was validated using numerical examples on a planar frame and experimental studies on a full-scale moment-resisting steel frame structure.
As artificial intelligence (AI) models continue to scale up, they are becoming more capable and integrated into various forms of decision-making systems. For models involved in moral decision-making, also known as artificial moral agents (AMA), interpretability provides a way to trust and understand the agent's internal reasoning mechanisms for effective use and error correction. In this paper, we provide an overview of this rapidly-evolving sub-field of AI interpretability, introduce the concept of the Minimum Level of Interpretability (MLI) and recommend an MLI for various types of agents, to aid their safe deployment in real-world settings.
Residual networks (ResNets) have displayed impressive results in pattern recognition and, recently, have garnered considerable theoretical interest due to a perceived link with neural ordinary differential equations (neural ODEs). This link relies on the convergence of network weights to a smooth function as the number of layers increases. We investigate the properties of weights trained by stochastic gradient descent and their scaling with network depth through detailed numerical experiments. We observe the existence of scaling regimes markedly different from those assumed in neural ODE literature. Depending on certain features of the network architecture, such as the smoothness of the activation function, one may obtain an alternative ODE limit, a stochastic differential equation or neither of these. These findings cast doubts on the validity of the neural ODE model as an adequate asymptotic description of deep ResNets and point to an alternative class of differential equations as a better description of the deep network limit.
We describe the new field of mathematical analysis of deep learning. This field emerged around a list of research questions that were not answered within the classical framework of learning theory. These questions concern: the outstanding generalization power of overparametrized neural networks, the role of depth in deep architectures, the apparent absence of the curse of dimensionality, the surprisingly successful optimization performance despite the non-convexity of the problem, understanding what features are learned, why deep architectures perform exceptionally well in physical problems, and which fine aspects of an architecture affect the behavior of a learning task in which way. We present an overview of modern approaches that yield partial answers to these questions. For selected approaches, we describe the main ideas in more detail.
Dynamic programming (DP) solves a variety of structured combinatorial problems by iteratively breaking them down into smaller subproblems. In spite of their versatility, DP algorithms are usually non-differentiable, which hampers their use as a layer in neural networks trained by backpropagation. To address this issue, we propose to smooth the max operator in the dynamic programming recursion, using a strongly convex regularizer. This allows to relax both the optimal value and solution of the original combinatorial problem, and turns a broad class of DP algorithms into differentiable operators. Theoretically, we provide a new probabilistic perspective on backpropagating through these DP operators, and relate them to inference in graphical models. We derive two particular instantiations of our framework, a smoothed Viterbi algorithm for sequence prediction and a smoothed DTW algorithm for time-series alignment. We showcase these instantiations on two structured prediction tasks and on structured and sparse attention for neural machine translation.