The language of information theory is favored in both causal reasoning and machine learning frameworks. But, is there a better language than this? In this study, we demonstrate the pitfalls of infotheoretic estimation using first order statistics on (short) sequences for causal learning. We recommend the use of data compression based approaches for causality testing since these make very little assumptions on data as opposed to infotheoretic measures, and are more robust to finite data length effects. We conclude with a discussion on the challenges posed in modeling the effects of conditioning process $X$ with another process $Y$ in causal machine learning. Specifically, conditioning can increase 'confusion' which can be difficult to model by classical information theory. A conscious causal agent creates new choices, decisions and meaning which poses huge challenges for AI.
This paper studies the Random Utility Model (RUM) in environments where the decision maker is imperfectly informed about the payoffs associated to each of the alternatives he faces. By embedding the RUM into an online decision problem, we make four contributions. First, we propose a gradient-based learning algorithm and show that a large class of RUMs are Hannan consistent (\citet{Hahn1957}); that is, the average difference between the expected payoffs generated by a RUM and that of the best fixed policy in hindsight goes to zero as the number of periods increase. Second, we show that the class of Generalized Extreme Value (GEV) models can be implemented with our learning algorithm. Examples in the GEV class include the Nested Logit, Ordered, and Product Differentiation models among many others. Third, we show that our gradient-based algorithm is the dual, in a convex analysis sense, of the Follow the Regularized Leader (FTRL) algorithm, which is widely used in the Machine Learning literature. Finally, we discuss how our approach can incorporate recency bias and be used to implement prediction markets in general environments.
Dynamical systems are widely used in science and engineering to model systems consisting of several interacting components. Often, they can be given a causal interpretation in the sense that they not only model the evolution of the states of the system's components over time, but also describe how their evolution is affected by external interventions on the system that perturb the dynamics. We introduce the formal framework of structural dynamical causal models (SDCMs) that explicates the causal semantics of the system's components as part of the model. SDCMs represent a dynamical system as a collection of stochastic processes and specify the basic causal mechanisms that govern the dynamics of each component as a structured system of random differential equations of arbitrary order. SDCMs extend the versatile causal modeling framework of structural causal models (SCMs), also known as structural equation models (SEMs), by explicitly allowing for time-dependence. An SDCM can be thought of as the stochastic-process version of an SCM, where the static random variables of the SCM are replaced by dynamic stochastic processes and their derivatives. We provide the foundations for a theory of SDCMs, by (i) formally defining SDCMs, their solutions, stochastic interventions, and a graphical representation; (ii) studying existence and uniqueness of the solutions for given initial conditions; (iii) discussing under which conditions SDCMs equilibrate to SCMs as time tends to infinity; (iv) relating the properties of the SDCM to those of the equilibrium SCM. This correspondence enables one to leverage the wealth of statistical tools and discovery methods available for SCMs when studying the causal semantics of a large class of stochastic dynamical systems. The theory is illustrated with several well-known examples from different scientific domains.
Comparing probability distributions is at the crux of many machine learning algorithms. Maximum Mean Discrepancies (MMD) and Optimal Transport distances (OT) are two classes of distances between probability measures that have attracted abundant attention in past years. This paper establishes some conditions under which the Wasserstein distance can be controlled by MMD norms. Our work is motivated by the compressive statistical learning (CSL) theory, a general framework for resource-efficient large scale learning in which the training data is summarized in a single vector (called sketch) that captures the information relevant to the considered learning task. Inspired by existing results in CSL, we introduce the H\"older Lower Restricted Isometric Property (H\"older LRIP) and show that this property comes with interesting guarantees for compressive statistical learning. Based on the relations between the MMD and the Wasserstein distance, we provide guarantees for compressive statistical learning by introducing and studying the concept of Wasserstein learnability of the learning task, that is when some task-specific metric between probability distributions can be bounded by a Wasserstein distance.
We study the problem of inverse reinforcement learning (IRL), where the learning agent recovers a reward function using expert demonstrations. Most of the existing IRL techniques make the often unrealistic assumption that the agent has access to full information about the environment. We remove this assumption by developing an algorithm for IRL in partially observable Markov decision processes (POMDPs). The algorithm addresses several limitations of existing techniques that do not take the information asymmetry between the expert and the learner into account. First, it adopts causal entropy as the measure of the likelihood of the expert demonstrations as opposed to entropy in most existing IRL techniques, and avoids a common source of algorithmic complexity. Second, it incorporates task specifications expressed in temporal logic into IRL. Such specifications may be interpreted as side information available to the learner a priori in addition to the demonstrations and may reduce the information asymmetry. Nevertheless, the resulting formulation is still nonconvex due to the intrinsic nonconvexity of the so-called forward problem, i.e., computing an optimal policy given a reward function, in POMDPs. We address this nonconvexity through sequential convex programming and introduce several extensions to solve the forward problem in a scalable manner. This scalability allows computing policies that incorporate memory at the expense of added computational cost yet also outperform memoryless policies. We demonstrate that, even with severely limited data, the algorithm learns reward functions and policies that satisfy the task and induce a similar behavior to the expert by leveraging the side information and incorporating memory into the policy.
We derive information-theoretic generalization bounds for supervised learning algorithms based on the information contained in predictions rather than in the output of the training algorithm. These bounds improve over the existing information-theoretic bounds, are applicable to a wider range of algorithms, and solve two key challenges: (a) they give meaningful results for deterministic algorithms and (b) they are significantly easier to estimate. We show experimentally that the proposed bounds closely follow the generalization gap in practical scenarios for deep learning.
This paper focuses on the expected difference in borrower's repayment when there is a change in the lender's credit decisions. Classical estimators overlook the confounding effects and hence the estimation error can be magnificent. As such, we propose another approach to construct the estimators such that the error can be greatly reduced. The proposed estimators are shown to be unbiased, consistent, and robust through a combination of theoretical analysis and numerical testing. Moreover, we compare the power of estimating the causal quantities between the classical estimators and the proposed estimators. The comparison is tested across a wide range of models, including linear regression models, tree-based models, and neural network-based models, under different simulated datasets that exhibit different levels of causality, different degrees of nonlinearity, and different distributional properties. Most importantly, we apply our approaches to a large observational dataset provided by a global technology firm that operates in both the e-commerce and the lending business. We find that the relative reduction of estimation error is strikingly substantial if the causal effects are accounted for correctly.
A comprehensive artificial intelligence system needs to not only perceive the environment with different `senses' (e.g., seeing and hearing) but also infer the world's conditional (or even causal) relations and corresponding uncertainty. The past decade has seen major advances in many perception tasks such as visual object recognition and speech recognition using deep learning models. For higher-level inference, however, probabilistic graphical models with their Bayesian nature are still more powerful and flexible. In recent years, Bayesian deep learning has emerged as a unified probabilistic framework to tightly integrate deep learning and Bayesian models. In this general framework, the perception of text or images using deep learning can boost the performance of higher-level inference and in turn, the feedback from the inference process is able to enhance the perception of text or images. This survey provides a comprehensive introduction to Bayesian deep learning and reviews its recent applications on recommender systems, topic models, control, etc. Besides, we also discuss the relationship and differences between Bayesian deep learning and other related topics such as Bayesian treatment of neural networks.
Federated learning (FL) is a machine learning setting where many clients (e.g. mobile devices or whole organizations) collaboratively train a model under the orchestration of a central server (e.g. service provider), while keeping the training data decentralized. FL embodies the principles of focused data collection and minimization, and can mitigate many of the systemic privacy risks and costs resulting from traditional, centralized machine learning and data science approaches. Motivated by the explosive growth in FL research, this paper discusses recent advances and presents an extensive collection of open problems and challenges.
Deep learning has penetrated all aspects of our lives and brought us great convenience. However, the process of building a high-quality deep learning system for a specific task is not only time-consuming but also requires lots of resources and relies on human expertise, which hinders the development of deep learning in both industry and academia. To alleviate this problem, a growing number of research projects focus on automated machine learning (AutoML). In this paper, we provide a comprehensive and up-to-date study on the state-of-the-art AutoML. First, we introduce the AutoML techniques in details according to the machine learning pipeline. Then we summarize existing Neural Architecture Search (NAS) research, which is one of the most popular topics in AutoML. We also compare the models generated by NAS algorithms with those human-designed models. Finally, we present several open problems for future research.
The era of big data provides researchers with convenient access to copious data. However, people often have little knowledge about it. The increasing prevalence of big data is challenging the traditional methods of learning causality because they are developed for the cases with limited amount of data and solid prior causal knowledge. This survey aims to close the gap between big data and learning causality with a comprehensive and structured review of traditional and frontier methods and a discussion about some open problems of learning causality. We begin with preliminaries of learning causality. Then we categorize and revisit methods of learning causality for the typical problems and data types. After that, we discuss the connections between learning causality and machine learning. At the end, some open problems are presented to show the great potential of learning causality with data.