In this work, we provide a fundamental unified convergence theorem used for deriving expected and almost sure convergence results for a series of stochastic optimization methods. Our unified theorem only requires to verify several representative conditions and is not tailored to any specific algorithm. As a direct application, we recover expected and almost sure convergence results of the stochastic gradient method (SGD) and random reshuffling (RR) under more general settings. Moreover, we establish new expected and almost sure convergence results for the stochastic proximal gradient method (prox-SGD) and stochastic model-based methods (SMM) for nonsmooth nonconvex optimization problems. These applications reveal that our unified theorem provides a plugin-type convergence analysis and strong convergence guarantees for a wide class of stochastic optimization methods.
In this paper, we study the almost sure boundedness and the convergence of the stochastic approximation (SA) algorithm. At present, most available convergence proofs are based on the ODE method, and the almost sure boundedness of the iterations is an assumption and not a conclusion. In Borkar-Meyn (2000), it is shown that if the ODE has only one globally attractive equilibrium, then under additional assumptions, the iterations are bounded almost surely, and the SA algorithm converges to the desired solution. Our objective in the present paper is to provide an alternate proof of the above, based on martingale methods, which are simpler and less technical than those based on the ODE method. As a prelude, we prove a new sufficient condition for the global asymptotic stability of an ODE. Next we prove a ``converse'' Lyapunov theorem on the existence of a suitable Lyapunov function with a globally bounded Hessian, for a globally exponentially stable system. Both theorems are of independent interest to researchers in stability theory. Then, using these results, we provide sufficient conditions for the almost sure boundedness and the convergence of the SA algorithm. We show through examples that our theory covers some situations that are not covered by currently known results, specifically Borkar-Meyn (2000).
With a few exceptions, work in offline reinforcement learning (RL) has so far assumed that there is no confounding. In a classical regression setting, confounders introduce omitted variable bias and inhibit the identification of causal effects. In offline RL, they prevent the identification of a policy's value, and therefore make it impossible to perform policy improvement. Using conventional methods in offline RL in the presence of confounding can therefore not only lead to poor decisions and poor policies, but can also have disastrous effects in applications such as healthcare and education. We provide approaches for both off-policy evaluation (OPE) and local policy optimization in the settings of i.i.d. and global confounders. Theoretical and empirical results confirm the validity and viability of these methods.
For basic machine learning problems, expected error is used to evaluate model performance. Since the distribution of data is usually unknown, we can make simple hypothesis that the data are sampled independently and identically distributed (i.i.d.) and the mean value of loss function is used as the empirical risk by Law of Large Numbers (LLN). This is known as the Monte Carlo method. However, when LLN is not applicable, such as imbalanced data problems, empirical risk will cause overfitting and might decrease robustness and generalization ability. Inspired by the framework of nonlinear expectation theory, we substitute the mean value of loss function with the maximum value of subgroup mean loss. We call it nonlinear Monte Carlo method. In order to use numerical method of optimization, we linearize and smooth the functional of maximum empirical risk and get the descent direction via quadratic programming. With the proposed method, we achieve better performance than SOTA backbone models with less training steps, and more robustness for basic regression and imbalanced classification tasks.
Reinforcement learning (RL) problems over general state and action spaces are notoriously challenging. In contrast to the tableau setting, one can not enumerate all the states and then iteratively update the policies for each state. This prevents the application of many well-studied RL methods especially those with provable convergence guarantees. In this paper, we first present a substantial generalization of the recently developed policy mirror descent method to deal with general state and action spaces. We introduce new approaches to incorporate function approximation into this method, so that we do not need to use explicit policy parameterization at all. Moreover, we present a novel policy dual averaging method for which possibly simpler function approximation techniques can be applied. We establish linear convergence rate to global optimality or sublinear convergence to stationarity for these methods applied to solve different classes of RL problems under exact policy evaluation. We then define proper notions of the approximation errors for policy evaluation and investigate their impact on the convergence of these methods applied to general-state RL problems with either finite-action or continuous-action spaces. To the best of our knowledge, the development of these algorithmic frameworks as well as their convergence analysis appear to be new in the literature.
This note complements the upcoming paper "One-Way Ticket to Las Vegas and the Quantum Adversary" by Belovs and Yolcu, to be presented at QIP 2023. I develop the ideas behind the adversary bound - universal algorithm duality therein in a different form. This form may be faster to understand for a general quantum information audience: It avoids defining the "unidirectional filtered $\gamma _{2}$-bound" and relating it to query algorithms explicitly. This proof is also more general because the lower bound (and universal query algorithm) apply to a class of optimal control problems rather than just query problems. That is in addition to the advantages to be discussed in Belovs-Yolcu, namely the more elementary algorithm and correctness proof that avoids phase estimation and spectral analysis, allows for limited treatment of noise, and removes another $\Theta(\log(1/\epsilon))$ factor from the runtime compared to the previous discrete-time algorithm.
Decentralized optimization is gaining increased traction due to its widespread applications in large-scale machine learning and multi-agent systems. The same mechanism that enables its success, i.e., information sharing among participating agents, however, also leads to the disclosure of individual agents' private information, which is unacceptable when sensitive data are involved. As differential privacy is becoming a de facto standard for privacy preservation, recently results have emerged integrating differential privacy with distributed optimization. However, directly incorporating differential privacy design in existing distributed optimization approaches significantly compromises optimization accuracy. In this paper, we propose to redesign and tailor gradient methods for differentially-private distributed optimization, and propose two differential-privacy oriented gradient methods that can ensure both rigorous epsilon-differential privacy and optimality. The first algorithm is based on static-consensus based gradient methods, and the second algorithm is based on dynamic-consensus (gradient-tracking) based distributed optimization methods and, hence, is applicable to general directed interaction graph topologies. Both algorithms can simultaneously ensure almost sure convergence to an optimal solution and a finite privacy budget, even when the number of iterations goes to infinity. To our knowledge, this is the first time that both goals are achieved simultaneously. Numerical simulations using a distributed estimation problem and experimental results on a benchmark dataset confirm the effectiveness of the proposed approaches.
Over the last decade, a series of applied mathematics papers have explored a type of inverse problem--called by a variety of names including "inverse sensitivity", "pushforward based inference", "consistent Bayesian inference", or "data-consistent inversion"--wherein a solution is a probability density whose pushforward takes a given form. The formulation of such a stochastic inverse problem can be unexpected or confusing to those familiar with traditional Bayesian or otherwise statistical inference. To date, two classes of solutions have been proposed, and these have only been justified through applications of measure theory and its disintegration theorem. In this work we show that, under mild assumptions, the formulation of and solution to all stochastic inverse problems can be more clearly understood using basic probability theory: a stochastic inverse problem is simply a change-of-variables or approximation thereof. For the two existing classes of solutions, we derive the relationship to change(s)-of-variables and illustrate using analytic examples where none had previously existed. Our derivations use neither Bayes' theorem nor the disintegration theorem explicitly. Our final contribution is a careful comparison of changes-of-variables to more traditional statistical inference. While taking stochastic inverse problems at face value for the majority of the paper, our final comparative discussion gives a critique of the framework.
In recent years, Graph Neural Networks have reported outstanding performance in tasks like community detection, molecule classification and link prediction. However, the black-box nature of these models prevents their application in domains like health and finance, where understanding the models' decisions is essential. Counterfactual Explanations (CE) provide these understandings through examples. Moreover, the literature on CE is flourishing with novel explanation methods which are tailored to graph learning. In this survey, we analyse the existing Graph Counterfactual Explanation methods, by providing the reader with an organisation of the literature according to a uniform formal notation for definitions, datasets, and metrics, thus, simplifying potential comparisons w.r.t to the method advantages and disadvantages. We discussed seven methods and sixteen synthetic and real datasets providing details on the possible generation strategies. We highlight the most common evaluation strategies and formalise nine of the metrics used in the literature. We first introduce the evaluation framework GRETEL and how it is possible to extend and use it while providing a further dimension of comparison encompassing reproducibility aspects. Finally, we provide a discussion on how counterfactual explanation interplays with privacy and fairness, before delving into open challenges and future works.
Deep models trained in supervised mode have achieved remarkable success on a variety of tasks. When labeled samples are limited, self-supervised learning (SSL) is emerging as a new paradigm for making use of large amounts of unlabeled samples. SSL has achieved promising performance on natural language and image learning tasks. Recently, there is a trend to extend such success to graph data using graph neural networks (GNNs). In this survey, we provide a unified review of different ways of training GNNs using SSL. Specifically, we categorize SSL methods into contrastive and predictive models. In either category, we provide a unified framework for methods as well as how these methods differ in each component under the framework. Our unified treatment of SSL methods for GNNs sheds light on the similarities and differences of various methods, setting the stage for developing new methods and algorithms. We also summarize different SSL settings and the corresponding datasets used in each setting. To facilitate methodological development and empirical comparison, we develop a standardized testbed for SSL in GNNs, including implementations of common baseline methods, datasets, and evaluation metrics.
Since deep neural networks were developed, they have made huge contributions to everyday lives. Machine learning provides more rational advice than humans are capable of in almost every aspect of daily life. However, despite this achievement, the design and training of neural networks are still challenging and unpredictable procedures. To lower the technical thresholds for common users, automated hyper-parameter optimization (HPO) has become a popular topic in both academic and industrial areas. This paper provides a review of the most essential topics on HPO. The first section introduces the key hyper-parameters related to model training and structure, and discusses their importance and methods to define the value range. Then, the research focuses on major optimization algorithms and their applicability, covering their efficiency and accuracy especially for deep learning networks. This study next reviews major services and toolkits for HPO, comparing their support for state-of-the-art searching algorithms, feasibility with major deep learning frameworks, and extensibility for new modules designed by users. The paper concludes with problems that exist when HPO is applied to deep learning, a comparison between optimization algorithms, and prominent approaches for model evaluation with limited computational resources.