Optimal Transport (OT) theory has seen an increasing amount of attention from the computer science community due to its potency and relevance in modeling and machine learning. It introduces means that serve as powerful ways to compare probability distributions with each other, as well as producing optimal mappings to minimize cost functions. In this survey, we present a brief introduction and history, a survey of previous work and propose directions of future study. We will begin by looking at the history of optimal transport and introducing the founders of this field. We then give a brief glance into the algorithms related to OT. Then, we will follow up with a mathematical formulation and the prerequisites to understand OT. These include Kantorovich duality, entropic regularization, KL Divergence, and Wassertein barycenters. Since OT is a computationally expensive problem, we then introduce the entropy-regularized version of computing optimal mappings, which allowed OT problems to become applicable in a wide range of machine learning problems. In fact, the methods generated from OT theory are competitive with the current state-of-the-art methods. We follow this up by breaking down research papers that focus on image processing, graph learning, neural architecture search, document representation, and domain adaptation. We close the paper with a small section on future research. Of the recommendations presented, three main problems are fundamental to allow OT to become widely applicable but rely strongly on its mathematical formulation and thus are hardest to answer. Since OT is a novel method, there is plenty of space for new research, and with more and more competitive methods (either on an accuracy level or computational speed level) being created, the future of applied optimal transport is bright as it has become pervasive in machine learning.
We consider the problem of partitioning a graph into a non-fixed number of non-overlapping subgraphs of maximum density. The density of a partition is the sum of the densities of the subgraphs, where the density of the subgraph is the ratio of its number of edges and its number of vertices. This problem, called Dense Graph Partition, is known to be NP-hard on general graphs and polynomial-time solvable on trees, and polynomial-time 2-approximable. In this paper we study the restriction of Dense Graph Partition to particular sparse and dense graph classes. In particular, we prove that it is NP-hard on dense bipartite graphs as well as on cubic graphs. On dense bipartite graphs, we further show that it is W[2]-hard parameterized by the number of sets in the optimum solution. On dense graphs on $n$ vertices, it is polynomial-time solvable on graphs with minimum degree $n-3$ and NP-hard on $(n-4)$-regular graphs. We prove that it is polynomial-time $4/3$-approximable on cubic graphs and admits an efficient polynomial-time approximation scheme on $(n-4)$-regular graphs.
Gradient boosting is a prediction method that iteratively combines weak learners to produce a complex and accurate model. From an optimization point of view, the learning procedure of gradient boosting mimics a gradient descent on a functional variable. This paper proposes to build upon the proximal point algorithm, when the empirical risk to minimize is not differentiable, in order to introduce a novel boosting approach, called proximal boosting. Besides being motivated by non-differentiable optimization, the proposed algorithm benefits from algorithmic improvements such as controlling the approximation error and Nesterov's acceleration, in the same way as gradient boosting [Grubb and Bagnell, 2011, Biau et al., 2018]. This leads to two variants, respectively called residual proximal boosting and accelerated proximal boosting. Theoretical convergence is proved for the first two procedures under different hypotheses on the empirical risk and advantages of leveraging proximal methods for boosting are illustrated by numerical experiments on simulated and real-world data. In particular, we exhibit a favorable comparison over gradient boosting regarding convergence rate and prediction accuracy.
We study optimal transport problems in which finite-valued quantities of interest evolve dynamically over time in a stationary fashion. Mathematically, this is a special case of the general optimal transport problem in which the distributions under study represent stationary processes and the cost depends on a finite number of time points. In this setting, we argue that one should restrict attention to stationary couplings, also known as joinings, which have close connections with long run average cost. We introduce estimators of both optimal joinings and the optimal joining cost, and we establish their consistency under mild conditions. Under stronger mixing assumptions we establish finite-sample error rates for the same estimators that extend the best known results in the iid case. Finally, we extend the consistency and rate analysis to an entropy-penalized version of the optimal joining problem.
Graphs are widely used as a popular representation of the network structure of connected data. Graph data can be found in a broad spectrum of application domains such as social systems, ecosystems, biological networks, knowledge graphs, and information systems. With the continuous penetration of artificial intelligence technologies, graph learning (i.e., machine learning on graphs) is gaining attention from both researchers and practitioners. Graph learning proves effective for many tasks, such as classification, link prediction, and matching. Generally, graph learning methods extract relevant features of graphs by taking advantage of machine learning algorithms. In this survey, we present a comprehensive overview on the state-of-the-art of graph learning. Special attention is paid to four categories of existing graph learning methods, including graph signal processing, matrix factorization, random walk, and deep learning. Major models and algorithms under these categories are reviewed respectively. We examine graph learning applications in areas such as text, images, science, knowledge graphs, and combinatorial optimization. In addition, we discuss several promising research directions in this field.
Curriculum learning (CL) is a training strategy that trains a machine learning model from easier data to harder data, which imitates the meaningful learning order in human curricula. As an easy-to-use plug-in, the CL strategy has demonstrated its power in improving the generalization capacity and convergence rate of various models in a wide range of scenarios such as computer vision and natural language processing etc. In this survey article, we comprehensively review CL from various aspects including motivations, definitions, theories, and applications. We discuss works on curriculum learning within a general CL framework, elaborating on how to design a manually predefined curriculum or an automatic curriculum. In particular, we summarize existing CL designs based on the general framework of Difficulty Measurer+Training Scheduler and further categorize the methodologies for automatic CL into four groups, i.e., Self-paced Learning, Transfer Teacher, RL Teacher, and Other Automatic CL. We also analyze principles to select different CL designs that may benefit practical applications. Finally, we present our insights on the relationships connecting CL and other machine learning concepts including transfer learning, meta-learning, continual learning and active learning, etc., then point out challenges in CL as well as potential future research directions deserving further investigations.
We present a continuous formulation of machine learning, as a problem in the calculus of variations and differential-integral equations, very much in the spirit of classical numerical analysis and statistical physics. We demonstrate that conventional machine learning models and algorithms, such as the random feature model, the shallow neural network model and the residual neural network model, can all be recovered as particular discretizations of different continuous formulations. We also present examples of new models, such as the flow-based random feature model, and new algorithms, such as the smoothed particle method and spectral method, that arise naturally from this continuous formulation. We discuss how the issues of generalization error and implicit regularization can be studied under this framework.
Graphical causal inference as pioneered by Judea Pearl arose from research on artificial intelligence (AI), and for a long time had little connection to the field of machine learning. This article discusses where links have been and should be established, introducing key concepts along the way. It argues that the hard open problems of machine learning and AI are intrinsically related to causality, and explains how the field is beginning to understand them.
This paper surveys the machine learning literature and presents machine learning as optimization models. Such models can benefit from the advancement of numerical optimization techniques which have already played a distinctive role in several machine learning settings. Particularly, mathematical optimization models are presented for commonly used machine learning approaches for regression, classification, clustering, and deep neural networks as well new emerging applications in machine teaching and empirical model learning. The strengths and the shortcomings of these models are discussed and potential research directions are highlighted.
This manuscript surveys reinforcement learning from the perspective of optimization and control with a focus on continuous control applications. It surveys the general formulation, terminology, and typical experimental implementations of reinforcement learning and reviews competing solution paradigms. In order to compare the relative merits of various techniques, this survey presents a case study of the Linear Quadratic Regulator (LQR) with unknown dynamics, perhaps the simplest and best studied problem in optimal control. The manuscript describes how merging techniques from learning theory and control can provide non-asymptotic characterizations of LQR performance and shows that these characterizations tend to match experimental behavior. In turn, when revisiting more complex applications, many of the observed phenomena in LQR persist. In particular, theory and experiment demonstrate the role and importance of models and the cost of generality in reinforcement learning algorithms. This survey concludes with a discussion of some of the challenges in designing learning systems that safely and reliably interact with complex and uncertain environments and how tools from reinforcement learning and controls might be combined to approach these challenges.
Multi-Task Learning (MTL) is a learning paradigm in machine learning and its aim is to leverage useful information contained in multiple related tasks to help improve the generalization performance of all the tasks. In this paper, we give a survey for MTL. First, we classify different MTL algorithms into several categories: feature learning approach, low-rank approach, task clustering approach, task relation learning approach, dirty approach, multi-level approach and deep learning approach. In order to compare different approaches, we discuss the characteristics of each approach. In order to improve the performance of learning tasks further, MTL can be combined with other learning paradigms including semi-supervised learning, active learning, reinforcement learning, multi-view learning and graphical models. When the number of tasks is large or the data dimensionality is high, batch MTL models are difficult to handle this situation and online, parallel and distributed MTL models as well as feature hashing are reviewed to reveal the computational and storage advantages. Many real-world applications use MTL to boost their performance and we introduce some representative works. Finally, we present theoretical analyses and discuss several future directions for MTL.