We study an important variant of the stochastic multi-armed bandit (MAB) problem, which takes penalization into consideration. Instead of directly maximizing cumulative expected reward, we need to balance between the total reward and fairness level. In this paper, we present some new insights in MAB and formulate the problem in the penalization framework, where rigorous penalized regret can be well defined and more sophisticated regret analysis is possible. Under such a framework, we propose a hard-threshold UCB-like algorithm, which enjoys many merits including asymptotic fairness, nearly optimal regret, better tradeoff between reward and fairness. Both gap-dependent and gap-independent regret bounds have been established. Multiple insightful comments are given to illustrate the soundness of our theoretical analysis. Numerous experimental results corroborate the theory and show the superiority of our method over other existing methods.
Medical tasks are prone to inter-rater variability due to multiple factors such as image quality, professional experience and training, or guideline clarity. Training deep learning networks with annotations from multiple raters is a common practice that mitigates the model's bias towards a single expert. Reliable models generating calibrated outputs and reflecting the inter-rater disagreement are key to the integration of artificial intelligence in clinical practice. Various methods exist to take into account different expert labels. We focus on comparing three label fusion methods: STAPLE, average of the rater's segmentation, and random sampling of each rater's segmentation during training. Each label fusion method is studied using both the conventional training framework and the recently published SoftSeg framework that limits information loss by treating the segmentation task as a regression. Our results, across 10 data splittings on two public datasets, indicate that SoftSeg models, regardless of the ground truth fusion method, had better calibration and preservation of the inter-rater rater variability compared with their conventional counterparts without impacting the segmentation performance. Conventional models, i.e., trained with a Dice loss, with binary inputs, and sigmoid/softmax final activate, were overconfident and underestimated the uncertainty associated with inter-rater variability. Conversely, fusing labels by averaging with the SoftSeg framework led to underconfident outputs and overestimation of the rater disagreement. In terms of segmentation performance, the best label fusion method was different for the two datasets studied, indicating this parameter might be task-dependent. However, SoftSeg had segmentation performance systematically superior or equal to the conventionally trained models and had the best calibration and preservation of the inter-rater variability.
Adversarial training has been considered an imperative component for safely deploying neural network-based applications to the real world. To achieve stronger robustness, existing methods primarily focus on how to generate strong attacks by increasing the number of update steps, regularizing the models with the smoothed loss function, and injecting the randomness into the attack. Instead, we analyze the behavior of adversarial training through the lens of response frequency. We empirically discover that adversarial training causes neural networks to have low convergence to high-frequency information, resulting in highly oscillated predictions near each data. To learn high-frequency contents efficiently and effectively, we first prove that a universal phenomenon of frequency principle, i.e., \textit{lower frequencies are learned first}, still holds in adversarial training. Based on that, we propose phase-shifted adversarial training (PhaseAT) in which the model learns high-frequency components by shifting these frequencies to the low-frequency range where the fast convergence occurs. For evaluations, we conduct the experiments on CIFAR-10 and ImageNet with the adaptive attack carefully designed for reliable evaluation. Comprehensive results show that PhaseAT significantly improves the convergence for high-frequency information. This results in improved adversarial robustness by enabling the model to have smoothed predictions near each data.
Estimating hyperparameters has been a long-standing problem in machine learning. We consider the case where the task at hand is modeled as the solution to an optimization problem. Here the exact gradient with respect to the hyperparameters cannot be feasibly computed and approximate strategies are required. We introduce a unified framework for computing hypergradients that generalizes existing methods based on the implicit function theorem and automatic differentiation/backpropagation, showing that these two seemingly disparate approaches are actually tightly connected. Our framework is extremely flexible, allowing its subproblems to be solved with any suitable method, to any degree of accuracy. We derive a priori and computable a posteriori error bounds for all our methods, and numerically show that our a posteriori bounds are usually more accurate. Our numerical results also show that, surprisingly, for efficient bilevel optimization, the choice of hypergradient algorithm is at least as important as the choice of lower-level solver.
We propose efficient numerical schemes for implementing the natural gradient descent (NGD) for a broad range of metric spaces with applications to PDE-based optimization problems. Our technique represents the natural gradient direction as a solution to a standard least-squares problem. Hence, instead of calculating, storing, or inverting the information matrix directly, we apply efficient methods from numerical linear algebra. We treat both scenarios where the Jacobian, i.e., the derivative of the state variable with respect to the parameter, is either explicitly known or implicitly given through constraints. We can thus reliably compute several natural NGDs for a large-scale parameter space. In particular, we are able to compute Wasserstein NGD in thousands of dimensions, which was believed to be out of reach. Finally, our numerical results shed light on the qualitative differences between the standard gradient descent and various NGD methods based on different metric spaces in nonconvex optimization problems.
We study the competition for partners in two-sided matching markets with heterogeneous agent preferences, with a focus on how the equilibrium outcomes depend on the connectivity in the market. We model random partially connected markets, with each agent having an average degree $d$ in a random (undirected) graph, and a uniformly random preference ranking over their neighbors in the graph. We formally characterize stable matchings in large markets random with small imbalance and find a threshold in the connectivity $d$ at $\log^2 n$ (where $n$ is the number of agents on one side of the market) which separates a ``weak competition'' regime, where agents on both sides of the market do equally well, from a ``strong competition'' regime, where agents on the short (long) side of the market enjoy a significant advantage (disadvantage). Numerical simulations confirm and sharpen our theoretical predictions, and demonstrate robustness to our assumptions. We leverage our characterizations in two ways: First, we derive prescriptive insights into how to design the connectivity of the market to trade off optimally between the average agent welfare achieved and the number of agents who remain unmatched in the market. For most market primitives, we find that the optimal connectivity should lie in the weak competition regime or at the threshold between the regimes. Second, our analysis uncovers a new conceptual principle governing whether the short side enjoys a significant advantage in a given matching market, which can moreover be applied as a diagnostic tool given only basic summary statistics for the market. Counterfactual analyses using data on centralized high school admissions in a major USA city show the practical value of both our design insights and our diagnostic principle.
Competitive balance is the subject of much interest in the sports analytics literature and beyond. In this paper, we develop a statistical network model based on an extension of the stochastic block model to assess the balance between teams in a league. Here we represent the outcome of all matches in a football season as a dense network with nodes identified by teams and categorical edges representing the outcome of each game as a win, draw or a loss. The main focus and motivation for this paper is to provide a statistical framework to assess the issue of competitive balance in the context of the English First Division / Premier League over more than 40 seasons. The Premier League is arguably one of the most popular leagues in the world, in terms of its global reach and the revenue which it generates. Therefore it is of wide interest to assess its competitiveness. Our analysis provides evidence suggesting a structural change around the early 2000's from a reasonably balanced league to a two-tier league.
Stochastic kinetic models (SKMs) are increasingly used to account for the inherent stochasticity exhibited by interacting populations of species in areas such as epidemiology, population ecology and systems biology. Species numbers are modelled using a continuous-time stochastic process, and, depending on the application area of interest, this will typically take the form of a Markov jump process or an It\^o diffusion process. Widespread use of these models is typically precluded by their computational complexity. In particular, performing exact fully Bayesian inference in either modelling framework is challenging due to the intractability of the observed data likelihood, necessitating the use of computationally intensive techniques such as particle Markov chain Monte Carlo (particle MCMC). We propose to increase the computational and statistical efficiency of this approach by leveraging the tractability of an inexpensive surrogate derived directly from either the jump or diffusion process. The surrogate is used in three ways: in the design of a gradient-based parameter proposal, to construct an appropriate bridge and in the first stage of a delayed-acceptance step. We find that the resulting approach, which exactly targets the posterior of interest, offers substantial gains in efficiency over a standard particle MCMC implementation.
The combinatorial pure exploration of causal bandits is the following online learning task: given a causal graph with unknown causal inference distributions, in each round we choose a subset of variables to intervene or do no intervention, and observe the random outcomes of all random variables, with the goal that using as few rounds as possible, we can output an intervention that gives the best (or almost best) expected outcome on the reward variable $Y$ with probability at least $1-\delta$, where $\delta$ is a given confidence level. We provide the first gap-dependent and fully adaptive pure exploration algorithms on two types of causal models -- the binary generalized linear model (BGLM) and general graphs. For BGLM, our algorithm is the first to be designed specifically for this setting and achieves polynomial sample complexity, while all existing algorithms for general graphs have either sample complexity exponential to the graph size or some unreasonable assumptions. For general graphs, our algorithm provides a significant improvement on sample complexity, and it nearly matches the lower bound we prove. Our algorithms achieve such improvement by a novel integration of prior causal bandit algorithms and prior adaptive pure exploration algorithms, the former of which utilize the rich observational feedback in causal bandits but are not adaptive to reward gaps, while the latter of which have the issue in reverse.
This paper focuses on best arm identification (BAI) in stochastic multi-armed bandits (MABs) in the fixed-confidence, parametric setting. In such pure exploration problems, the accuracy of the sampling strategy critically hinges on the sequential allocation of the sampling resources among the arms. The existing approaches to BAI address the following question: what is an optimal sampling strategy when we spend a $\beta$ fraction of the samples on the best arm? These approaches treat $\beta$ as a tunable parameter and offer efficient algorithms that ensure optimality up to selecting $\beta$, hence $\beta-$optimality. However, the BAI decisions and performance can be highly sensitive to the choice of $\beta$. This paper provides a BAI algorithm that is agnostic to $\beta$, dispensing with the need for tuning $\beta$, and specifies an optimal allocation strategy, including the optimal value of $\beta$. Furthermore, the existing relevant literature focuses on the family of exponential distributions. This paper considers a more general setting of any arbitrary family of distributions parameterized by their mean values (under mild regularity conditions).
With the rapid increase of large-scale, real-world datasets, it becomes critical to address the problem of long-tailed data distribution (i.e., a few classes account for most of the data, while most classes are under-represented). Existing solutions typically adopt class re-balancing strategies such as re-sampling and re-weighting based on the number of observations for each class. In this work, we argue that as the number of samples increases, the additional benefit of a newly added data point will diminish. We introduce a novel theoretical framework to measure data overlap by associating with each sample a small neighboring region rather than a single point. The effective number of samples is defined as the volume of samples and can be calculated by a simple formula $(1-\beta^{n})/(1-\beta)$, where $n$ is the number of samples and $\beta \in [0,1)$ is a hyperparameter. We design a re-weighting scheme that uses the effective number of samples for each class to re-balance the loss, thereby yielding a class-balanced loss. Comprehensive experiments are conducted on artificially induced long-tailed CIFAR datasets and large-scale datasets including ImageNet and iNaturalist. Our results show that when trained with the proposed class-balanced loss, the network is able to achieve significant performance gains on long-tailed datasets.