We compute bias and variance of an efficiency estimator for a random processes, in which the success probability is constant but the number of trials is drawn from a Poisson distribution. The standard estimator, although being a non-linear function in this case, is unbiased. Compared to the case where the number of trials is fixed, the variance is increased or decreased depending on the expected number of trials. We further compute the variance for the case where the numbers of successes and failures have a variance which exceeds that of a Poisson process. This is the case, for example, when these numbers are obtained from a mixture of signal and background events in which the background is subtracted imperfectly. We compute generalised Wilson intervals based on these variances and study their coverage probability. We conclude that the standard Wilson interval is also suitable when the number of trials is Poisson distributed.
Punishing those who refuse to participate in common efforts is a known and intensively studied way to maintain cooperation among self-interested agents. But this act is costly, hence punishers who are generally also engaged in the original joint venture, become vulnerable, which jeopardizes the effectiveness of this incentive. As an alternative, we may hire special players, whose only duty is to watch the population and punish defectors. Such a policelike or mercenary punishment can be maintained by a tax-based fund. If this tax is negligible, a cyclic dominance may emerge among different strategies. When this tax is relevant then this solution disappears. In the latter case, the fine level becomes a significant factor that determines whether punisher players coexist with cooperators or alternatively with defectors. The maximal average outcome can be reached at an intermediate cost value of punishment. Our observations highlight that we should take special care when such kind of punishment and accompanying tax are introduced to reach a collective goal.
We consider Broyden's method and some accelerated schemes for nonlinear equations having a strongly regular singularity of first order with a one-dimensional nullspace. Our two main results are as follows. First, we show that the use of a preceding Newton--like step ensures convergence for starting points in a starlike domain with density 1. This extends the domain of convergence of these methods significantly. Second, we establish that the matrix updates of Broyden's method converge q-linearly with the same asymptotic factor as the iterates. This contributes to the long--standing question whether the Broyden matrices converge by showing that this is indeed the case for the setting at hand. Furthermore, we prove that the Broyden directions violate uniform linear independence, which implies that existing results for convergence of the Broyden matrices cannot be applied. Numerical experiments of high precision confirm the enlarged domain of convergence, the q-linear convergence of the matrix updates, and the lack of uniform linear independence. In addition, they suggest that these results can be extended to singularities of higher order and that Broyden's method can converge r-linearly without converging q-linearly. The underlying code is freely available.
Regression models are used in a wide range of applications providing a powerful scientific tool for researchers from different fields. Linear, or simple parametric, models are often not sufficient to describe complex relationships between input variables and a response. Such relationships can be better described through flexible approaches such as neural networks, but this results in less interpretable models and potential overfitting. Alternatively, specific parametric nonlinear functions can be used, but the specification of such functions is in general complicated. In this paper, we introduce a flexible approach for the construction and selection of highly flexible nonlinear parametric regression models. Nonlinear features are generated hierarchically, similarly to deep learning, but have additional flexibility on the possible types of features to be considered. This flexibility, combined with variable selection, allows us to find a small set of important features and thereby more interpretable models. Within the space of possible functions, a Bayesian approach, introducing priors for functions based on their complexity, is considered. A genetically modified mode jumping Markov chain Monte Carlo algorithm is adopted to perform Bayesian inference and estimate posterior probabilities for model averaging. In various applications, we illustrate how our approach is used to obtain meaningful nonlinear models. Additionally, we compare its predictive performance with several machine learning algorithms.
Interpretability is becoming increasingly important for predictive model analysis. Unfortunately, as remarked by many authors, there is still no consensus regarding this notion. The goal of this paper is to propose the definition of a score that allows to quickly compare interpretable algorithms. This definition consists of three terms, each one being quantitatively measured with a simple formula: predictivity, stability and simplicity. While predictivity has been extensively studied to measure the accuracy of predictive algorithms, stability is based on the Dice-Sorensen index for comparing two rule sets generated by an algorithm using two independent samples. The simplicity is based on the sum of the lengths of the rules derived from the predictive model. The proposed score is a weighted sum of the three terms mentioned above. We use this score to compare the interpretability of a set of rule-based algorithms and tree-based algorithms for the regression case and for the classification case.
Traditional quantile estimators that are based on one or two order statistics are a common way to estimate distribution quantiles based on the given samples. These estimators are robust, but their statistical efficiency is not always good enough. A more efficient alternative is the Harrell-Davis quantile estimator which uses a weighted sum of all order statistics. Whereas this approach provides more accurate estimations for the light-tailed distributions, it's not robust. To be able to customize the trade-off between statistical efficiency and robustness, we could consider a trimmed modification of the Harrell-Davis quantile estimator. In this approach, we discard order statistics with low weights according to the highest density interval of the beta distribution.
When we are faced with challenging image classification tasks, we often explain our reasoning by dissecting the image, and pointing out prototypical aspects of one class or another. The mounting evidence for each of the classes helps us make our final decision. In this work, we introduce a deep network architecture that reasons in a similar way: the network dissects the image by finding prototypical parts, and combines evidence from the prototypes to make a final classification. The model thus reasons in a way that is qualitatively similar to the way ornithologists, physicians, geologists, architects, and others would explain to people on how to solve challenging image classification tasks. The network uses only image-level labels for training, meaning that there are no labels for parts of images. We demonstrate our method on the CUB-200-2011 dataset and the CBIS-DDSM dataset. Our experiments show that our interpretable network can achieve comparable accuracy with its analogous standard non-interpretable counterpart as well as other interpretable deep models.
Efficient exploration remains a major challenge for reinforcement learning. One reason is that the variability of the returns often depends on the current state and action, and is therefore heteroscedastic. Classical exploration strategies such as upper confidence bound algorithms and Thompson sampling fail to appropriately account for heteroscedasticity, even in the bandit setting. Motivated by recent findings that address this issue in bandits, we propose to use Information-Directed Sampling (IDS) for exploration in reinforcement learning. As our main contribution, we build on recent advances in distributional reinforcement learning and propose a novel, tractable approximation of IDS for deep Q-learning. The resulting exploration strategy explicitly accounts for both parametric uncertainty and heteroscedastic observation noise. We evaluate our method on Atari games and demonstrate a significant improvement over alternative approaches.
Active learning has long been a topic of study in machine learning. However, as increasingly complex and opaque models have become standard practice, the process of active learning, too, has become more opaque. There has been little investigation into interpreting what specific trends and patterns an active learning strategy may be exploring. This work expands on the Local Interpretable Model-agnostic Explanations framework (LIME) to provide explanations for active learning recommendations. We demonstrate how LIME can be used to generate locally faithful explanations for an active learning strategy, and how these explanations can be used to understand how different models and datasets explore a problem space over time. In order to quantify the per-subgroup differences in how an active learning strategy queries spatial regions, we introduce a notion of uncertainty bias (based on disparate impact) to measure the discrepancy in the confidence for a model's predictions between one subgroup and another. Using the uncertainty bias measure, we show that our query explanations accurately reflect the subgroup focus of the active learning queries, allowing for an interpretable explanation of what is being learned as points with similar sources of uncertainty have their uncertainty bias resolved. We demonstrate that this technique can be applied to track uncertainty bias over user-defined clusters or automatically generated clusters based on the source of uncertainty.
We consider the task of learning the parameters of a {\em single} component of a mixture model, for the case when we are given {\em side information} about that component, we call this the "search problem" in mixture models. We would like to solve this with computational and sample complexity lower than solving the overall original problem, where one learns parameters of all components. Our main contributions are the development of a simple but general model for the notion of side information, and a corresponding simple matrix-based algorithm for solving the search problem in this general setting. We then specialize this model and algorithm to four common scenarios: Gaussian mixture models, LDA topic models, subspace clustering, and mixed linear regression. For each one of these we show that if (and only if) the side information is informative, we obtain parameter estimates with greater accuracy, and also improved computation complexity than existing moment based mixture model algorithms (e.g. tensor methods). We also illustrate several natural ways one can obtain such side information, for specific problem instances. Our experiments on real data sets (NY Times, Yelp, BSDS500) further demonstrate the practicality of our algorithms showing significant improvement in runtime and accuracy.
Robust estimation is much more challenging in high dimensions than it is in one dimension: Most techniques either lead to intractable optimization problems or estimators that can tolerate only a tiny fraction of errors. Recent work in theoretical computer science has shown that, in appropriate distributional models, it is possible to robustly estimate the mean and covariance with polynomial time algorithms that can tolerate a constant fraction of corruptions, independent of the dimension. However, the sample and time complexity of these algorithms is prohibitively large for high-dimensional applications. In this work, we address both of these issues by establishing sample complexity bounds that are optimal, up to logarithmic factors, as well as giving various refinements that allow the algorithms to tolerate a much larger fraction of corruptions. Finally, we show on both synthetic and real data that our algorithms have state-of-the-art performance and suddenly make high-dimensional robust estimation a realistic possibility.