While solutions of Distributionally Robust Optimization (DRO) problems can sometimes have a higher out-of-sample expected reward than the Sample Average Approximation (SAA), there is no guarantee. In this paper, we introduce the class of Distributionally Optimistic Optimization (DOO) models, and show that it is always possible to "beat" SAA out-of-sample if we consider not just worst-case (DRO) models but also best-case (DOO) ones. We also show, however, that this comes at a cost: Optimistic solutions are more sensitive to model error than either worst-case or SAA optimizers, and hence are less robust.
Labelling data is a major practical bottleneck in training and testing classifiers. Given a collection of unlabelled data points, we address how to select which subset to label to best estimate test metrics such as accuracy, $F_1$ score or micro/macro $F_1$. We consider two sampling based approaches, namely the well-known Importance Sampling and we introduce a novel application of Poisson Sampling. For both approaches we derive the minimal error sampling distributions and how to approximate and use them to form estimators and confidence intervals. We show that Poisson Sampling outperforms Importance Sampling both theoretically and experimentally.
Metric Multidimensional scaling (MDS) is a classical method for generating meaningful (non-linear) low-dimensional embeddings of high-dimensional data. MDS has a long history in the statistics, machine learning, and graph drawing communities. In particular, the Kamada-Kawai force-directed graph drawing method is equivalent to MDS and is one of the most popular ways in practice to embed graphs into low dimensions. Despite its ubiquity, our theoretical understanding of MDS remains limited as its objective function is highly non-convex. In this paper, we prove that minimizing the Kamada-Kawai objective is NP-hard and give a provable approximation algorithm for optimizing it, which in particular is a PTAS on low-diameter graphs. We supplement this result with experiments suggesting possible connections between our greedy approximation algorithm and gradient-based methods.
We propose a new approach to defining programming languages with effects, and proving observational equivalence. Operational machinery is given by a hypergraph-rewriting abstract machine inspired by Girard's Geometry of Interaction. It interprets a graph calculus of only three intrinsic constructs: variable binding, atom binding, and thunking. Everything else, including features which are commonly thought of as intrinsic, such as arithmetic or function abstraction and application, must be provided as extrinsic operations, with associated rewrite rules. The graph representation naturally leads to two new principles about equational reasoning, which we call \emph{locality} and \emph{robustness}. These concepts enable a novel flexible and powerful reasoning methodology about (type-free) languages with effects. The methodology is additionally capable of proving a generalised notion of observational equivalence that can be quantified over syntactically restricted contexts instead of all contexts, and also can be quantitatively constrained in terms of the number of reduction steps. We illustrate the methodology using the call-by-value lambda-calculus extended with (higher-order) state.
The Bayesian decision-theoretic approach to design of experiments involves specifying a design (values of all controllable variables) to maximise the expected utility function (expectation with respect to the distribution of responses and parameters). For most common utility functions, the expected utility is rarely available in closed form and requires a computationally expensive approximation which then needs to be maximised over the space of all possible designs. This hinders practical use of the Bayesian approach to find experimental designs. However, recently, a new utility called Fisher information gain has been proposed. The resulting expected Fisher information gain reduces to the prior expectation of the trace of the Fisher information matrix. Since the Fisher information is often available in closed form, this significantly simplifies approximation and subsequent identification of optimal designs. In this paper, it is shown that for exponential family models, maximising the expected Fisher information gain is equivalent to maximising an alternative objective function over a reduced-dimension space, simplifying even further the identification of optimal designs. However, if this function does not have enough global maxima, then designs that maximise the expected Fisher information gain lead to non-identifiablility.
Stochastic gradient methods have enabled variational inference for high-dimensional models and large data sets. However, the direction of steepest ascent in the parameter space of a statistical model is given not by the commonly used Euclidean gradient, but the natural gradient which premultiplies the Euclidean gradient by the inverse of the Fisher information matrix. Use of natural gradients in optimization can improve convergence significantly, but inverting the Fisher information matrix is daunting in high-dimensions. The contribution of this article is twofold. First, we derive the natural gradient updates of a Gaussian variational approximation in terms of the mean and Cholesky factor of the covariance matrix, and show that these updates depend only on the first derivative of the variational objective function. Second, we derive complete natural gradient updates for structured variational approximations with a minimal conditional exponential family representation, which include highly flexible mixture of exponential family distributions that can fit skewed or multimodal posteriors. These updates, albeit more complex than those presented priorly, account fully for the dependence between the mixing distribution and the distributions of the components. Further experiments will be carried out to evaluate the performance of proposed methods.
Heat Equation Driven Area Coverage (HEDAC) is a state-of-the-art multi-agent ergodic motion control guided by a gradient of a potential field. A finite element method is hereby implemented to obtain a solution of Helmholtz partial differential equation, which models the potential field for surveying motion control. This allows us to survey arbitrarily shaped domains and to include obstacles in an elegant and robust manner intrinsic to HEDAC's fundamental idea. For a simple kinematic motion, the obstacles and boundary avoidance constraints are successfully handled by directing the agent motion with the gradient of the potential. However, including additional constraints, such as the minimal clearance dsitance from stationary and moving obstacles and the minimal path curvature radius, requires further alternations of the control algorithm. We introduce a relatively simple yet robust approach for handling these constraints by formulating a straightforward optimization problem based on collision-free escapes route maneuvers. This approach provides a guaranteed collision avoidance mechanism, while being computationally inexpensive as a result of the optimization problem partitioning. The proposed motion control is evaluated in three realistic surveying scenarios simulations, showing the effectiveness of the surveying and the robustness of the control algorithm. Furthermore, potential maneuvering difficulties due to improperly defined surveying scenarios are highlighted and we provide guidelines on how to overpass them. The results are promising and indiacate real-world applicability of proposed constrained multi-agent motion control for autonomous surveying and potentially other HEDAC utilizations.
Optimal transport distances have found many applications in machine learning for their capacity to compare non-parametric probability distributions. Yet their algorithmic complexity generally prevents their direct use on large scale datasets. Among the possible strategies to alleviate this issue, practitioners can rely on computing estimates of these distances over subsets of data, {\em i.e.} minibatches. While computationally appealing, we highlight in this paper some limits of this strategy, arguing it can lead to undesirable smoothing effects. As an alternative, we suggest that the same minibatch strategy coupled with unbalanced optimal transport can yield more robust behavior. We discuss the associated theoretical properties, such as unbiased estimators, existence of gradients and concentration bounds. Our experimental study shows that in challenging problems associated to domain adaptation, the use of unbalanced optimal transport leads to significantly better results, competing with or surpassing recent baselines.
Although pretrained Transformers such as BERT achieve high accuracy on in-distribution examples, do they generalize to new distributions? We systematically measure out-of-distribution (OOD) generalization for various NLP tasks by constructing a new robustness benchmark with realistic distribution shifts. We measure the generalization of previous models including bag-of-words models, ConvNets, and LSTMs, and we show that pretrained Transformers' performance declines are substantially smaller. Pretrained transformers are also more effective at detecting anomalous or OOD examples, while many previous models are frequently worse than chance. We examine which factors affect robustness, finding that larger models are not necessarily more robust, distillation can be harmful, and more diverse pretraining data can enhance robustness. Finally, we show where future work can improve OOD robustness.
This paper addresses the problem of formally verifying desirable properties of neural networks, i.e., obtaining provable guarantees that neural networks satisfy specifications relating their inputs and outputs (robustness to bounded norm adversarial perturbations, for example). Most previous work on this topic was limited in its applicability by the size of the network, network architecture and the complexity of properties to be verified. In contrast, our framework applies to a general class of activation functions and specifications on neural network inputs and outputs. We formulate verification as an optimization problem (seeking to find the largest violation of the specification) and solve a Lagrangian relaxation of the optimization problem to obtain an upper bound on the worst case violation of the specification being verified. Our approach is anytime i.e. it can be stopped at any time and a valid bound on the maximum violation can be obtained. We develop specialized verification algorithms with provable tightness guarantees under special assumptions and demonstrate the practical significance of our general verification approach on a variety of verification tasks.
We consider the task of learning the parameters of a {\em single} component of a mixture model, for the case when we are given {\em side information} about that component, we call this the "search problem" in mixture models. We would like to solve this with computational and sample complexity lower than solving the overall original problem, where one learns parameters of all components. Our main contributions are the development of a simple but general model for the notion of side information, and a corresponding simple matrix-based algorithm for solving the search problem in this general setting. We then specialize this model and algorithm to four common scenarios: Gaussian mixture models, LDA topic models, subspace clustering, and mixed linear regression. For each one of these we show that if (and only if) the side information is informative, we obtain parameter estimates with greater accuracy, and also improved computation complexity than existing moment based mixture model algorithms (e.g. tensor methods). We also illustrate several natural ways one can obtain such side information, for specific problem instances. Our experiments on real data sets (NY Times, Yelp, BSDS500) further demonstrate the practicality of our algorithms showing significant improvement in runtime and accuracy.