Considerable debate has been generated in recent literature on whether non-confounding covariates should be adjusted for in the analysis of case-control data through logistic regression, and limited theoretical results are available regarding this problem. Zhang et al. (2018) proposed a constrained maximum likelihood approach that is seemingly more powerful than the approaches with or without adjusting for non-confounding covariates in logistic regression, but no theoretical justification was provided regarding this empirical finding. We provide rigorous justification for the relative performances of the above three approaches through Pitman's asymptotic relative efficiencies. Specifically, the constrained maximum likelihood approach is proved to be uniformly most powerful. On the other hand, the relative performance of the other two approaches heavily depends on disease prevalence, that is, adjust for non-confounding covariates can lead to power loss when the disease prevalence is low, but this is not the case otherwise.
Deep learning has had tremendous success at learning low-dimensional representations of high-dimensional data. This success would be impossible if there was no hidden low-dimensional structure in data of interest; this existence is posited by the manifold hypothesis, which states that the data lies on an unknown manifold of low intrinsic dimension. In this paper, we argue that this hypothesis does not properly capture the low-dimensional structure typically present in image data. Assuming that data lies on a single manifold implies intrinsic dimension is identical across the entire data space, and does not allow for subregions of this space to have a different number of factors of variation. To address this deficiency, we put forth the union of manifolds hypothesis, which states that data lies on a disjoint union of manifolds of varying intrinsic dimensions. We empirically verify this hypothesis on commonly-used image datasets, finding that indeed, observed data lies on a disconnected set and that intrinsic dimension is not constant. We also provide insights into the implications the union of manifolds hypothesis has for deep learning, both supervised and unsupervised, showing that designing models with an inductive bias for this structure improves performance across classification and generative modelling tasks.
Paired comparison models are used for analyzing data that involves pairwise comparisons among a set of objects. When the outcomes of the pairwise comparisons have no ties, the paired comparison models can be generalized as a class of binary response models. Receiver operating characteristic (ROC) curves and their corresponding areas under the curves are commonly used as performance metrics to evaluate the discriminating ability of binary response models. Despite their individual wide range of usage and their close connection to binary response models, ROC analysis to our knowledge has never been extended to paired comparison models since the problem of using different objects as the reference in paired comparison models prevents traditional ROC approach from generating unambiguous and interpretable curves. We focus on addressing this problem by proposing two novel methods to construct ROC curves for paired comparison data which provide interpretable statistics and maintain desired asymptotic properties. The methods are then applied and analyzed on head-to-head professional sports competition data.
Fine-tuning pre-trained models has been ubiquitously proven to be effective in a wide range of NLP tasks. However, fine-tuning the whole model is parameter inefficient as it always yields an entirely new model for each task. Currently, many research works propose to only fine-tune a small portion of the parameters while keeping most of the parameters shared across different tasks. These methods achieve surprisingly good performance and are shown to be more stable than their corresponding fully fine-tuned counterparts. However, such kind of methods is still not well understood. Some natural questions arise: How does the parameter sparsity lead to promising performance? Why is the model more stable than the fully fine-tuned models? How to choose the tunable parameters? In this paper, we first categorize the existing methods into random approaches, rule-based approaches, and projection-based approaches based on how they choose which parameters to tune. Then, we show that all of the methods are actually sparse fine-tuned models and conduct a novel theoretical analysis of them. We indicate that the sparsity is actually imposing a regularization on the original model by controlling the upper bound of the stability. Such stability leads to better generalization capability which has been empirically observed in a lot of recent research works. Despite the effectiveness of sparsity grounded by our theory, it still remains an open problem of how to choose the tunable parameters. To better choose the tunable parameters, we propose a novel Second-order Approximation Method (SAM) which approximates the original problem with an analytically solvable optimization function. The tunable parameters are determined by directly optimizing the approximation function. The experimental results show that our proposed SAM model outperforms many strong baseline models and it also verifies our theoretical analysis.
Evolutionary algorithms (EAs) are a sort of nature-inspired metaheuristics, which have wide applications in various practical optimization problems. In these problems, objective evaluations are usually inaccurate, because noise is almost inevitable in real world, and it is a crucial issue to weaken the negative effect caused by noise. Sampling is a popular strategy, which evaluates the objective a couple of times, and employs the mean of these evaluation results as an estimate of the objective value. In this work, we introduce a novel sampling method, median sampling, into EAs, and illustrate its properties and usefulness theoretically by solving OneMax, the problem of maximizing the number of 1s in a bit string. Instead of the mean, median sampling employs the median of the evaluation results as an estimate. Through rigorous theoretical analysis on OneMax under the commonly used onebit noise, we show that median sampling reduces the expected runtime exponentially. Next, through two special noise models, we show that when the 2-quantile of the noisy fitness increases with the true fitness, median sampling can be better than mean sampling; otherwise, it may fail and mean sampling can be better. The results may guide us to employ median sampling properly in practical applications.
All-gather collective communication is one of the most important communication primitives in parallel and distributed computation, which plays an essential role in many HPC applications such as distributed Deep Learning (DL) with model and hybrid parallelism. To solve the communication bottleneck of All-gather, optical interconnection network can provide unprecedented high bandwidth and reliability for data transfer among the distributed nodes. However, most traditional All-gather algorithms are designed for electrical interconnection, which cannot fit well for optical interconnect systems, resulting in poor performance. This paper proposes an efficient scheme, called OpTree, for All-gather operation on optical interconnect systems. OpTree derives an optimal $m$-ary tree corresponding to the optimal number of communication stages, achieving minimum communication time. We further analyze and compare the communication steps of OpTree with existing All-gather algorithms. Theoretical results exhibit that OpTree requires much less number of communication steps than existing All-gather algorithms on optical interconnect systems. Simulation results show that OpTree can reduce communication time by 72.21%, 94.30%, and 88.58%, respectively, compared with three existing All-gather schemes, WRHT, Ring, and NE.
In geostatistical problems with massive sample size, Gaussian processes (GP) can be approximated using sparse directed acyclic graphs to achieve scalable $O(n)$ computational complexity. In these models, data at each location are typically assumed conditionally dependent on a small set of parents which usually include a subset of the nearest neighbors. These methodologies often exhibit excellent empirical performance, but the lack of theoretical validation leads to unclear guidance in specifying the underlying graphical model and may result in sensitivity to graph choice. We address these issues by introducing radial neighbors Gaussian processes and corresponding theoretical guarantees. We propose to approximate GPs using a sparse directed acyclic graph in which a directed edge connects every location to all of its neighbors within a predetermined radius. Using our novel construction, we show that one can accurately approximate a Gaussian process in Wasserstein-2 distance, with an error rate determined by the approximation radius, the spatial covariance function, and the spatial dispersion of samples. Our method is also insensitive to specific graphical model choice. We offer further empirical validation of our approach via applications on simulated and real world data showing state-of-the-art performance in posterior inference of spatial random effects.
Simulation-based Bayesian inference (SBI) can be used to estimate the parameters of complex mechanistic models given observed model outputs without requiring access to explicit likelihood evaluations. A prime example for the application of SBI in neuroscience involves estimating the parameters governing the response dynamics of Hodgkin-Huxley (HH) models from electrophysiological measurements, by inferring a posterior over the parameters that is consistent with a set of observations. To this end, many SBI methods employ a set of summary statistics or scientifically interpretable features to estimate a surrogate likelihood or posterior. However, currently, there is no way to identify how much each summary statistic or feature contributes to reducing posterior uncertainty. To address this challenge, one could simply compare the posteriors with and without a given feature included in the inference process. However, for large or nested feature sets, this would necessitate repeatedly estimating the posterior, which is computationally expensive or even prohibitive. Here, we provide a more efficient approach based on the SBI method neural likelihood estimation (NLE): We show that one can marginalize the trained surrogate likelihood post-hoc before inferring the posterior to assess the contribution of a feature. We demonstrate the usefulness of our method by identifying the most important features for inferring parameters of an example HH neuron model. Beyond neuroscience, our method is generally applicable to SBI workflows that rely on data features for inference used in other scientific fields.
This article explores which parameters of the repeated Prisoner's Dilemma lead to cooperation. Using simulations, I demonstrate that the potential function of the stochastic evolutionary dynamics of the Grim Trigger strategy is useful to predict cooperation between Q-learners. The frontier separating the parameter spaces that induce either cooperation or defection can be determined based on the kinetic energy exerted by the respective basins of attraction. When the incentive compatibility constraint of the Grim Trigger strategy is slack, a sudden increase in the observed cooperation rates occurs when the ratio of the kinetic energies approaches a critical value, which itself is a function of the discount factor, multiplied by a correction factor to account for the effect of the algorithms' exploration probability. Using metadata from laboratory experiments, I provide evidence that the insights obtained from the simulations are also useful to explain the emergence of cooperation between humans. The observed cooperation rates show a positive gradient at the frontier characterized by an exploration probability of approximately five percent. In the context of human-to-human interaction, the exploration probability can be viewed as the belief about the opponent's probability to deviate from the equilibrium action.
We consider performing simulation experiments in the presence of covariates. Here, covariates refer to some input information other than system designs to the simulation model that can also affect the system performance. To make decisions, decision makers need to know the covariate values of the problem. Traditionally in simulation-based decision making, simulation samples are collected after the covariate values are known; in contrast, as a new framework, simulation with covariates starts the simulation before the covariate values are revealed, and collects samples on covariate values that might appear later. Then, when the covariate values are revealed, the collected simulation samples are directly used to predict the desired results. This framework significantly reduces the decision time compared to the traditional way of simulation. In this paper, we follow this framework and suppose there are a finite number of system designs. We adopt the metamodel of stochastic kriging (SK) and use it to predict the system performance of each design and the best design. The goal is to study how fast the prediction errors diminish with the number of covariate points sampled. This is a fundamental problem in simulation with covariates and helps quantify the relationship between the offline simulation efforts and the online prediction accuracy. Particularly, we adopt measures of the maximal integrated mean squared error (IMSE) and integrated probability of false selection (IPFS) for assessing errors of the system performance and the best design predictions. Then, we establish convergence rates for the two measures under mild conditions. Last, these convergence behaviors are illustrated numerically using test examples.
Offline reinforcement learning, which aims at optimizing sequential decision-making strategies with historical data, has been extensively applied in real-life applications. State-Of-The-Art algorithms usually leverage powerful function approximators (e.g. neural networks) to alleviate the sample complexity hurdle for better empirical performances. Despite the successes, a more systematic understanding of the statistical complexity for function approximation remains lacking. Towards bridging the gap, we take a step by considering offline reinforcement learning with differentiable function class approximation (DFA). This function class naturally incorporates a wide range of models with nonlinear/nonconvex structures. Most importantly, we show offline RL with differentiable function approximation is provably efficient by analyzing the pessimistic fitted Q-learning (PFQL) algorithm, and our results provide the theoretical basis for understanding a variety of practical heuristics that rely on Fitted Q-Iteration style design. In addition, we further improve our guarantee with a tighter instance-dependent characterization. We hope our work could draw interest in studying reinforcement learning with differentiable function approximation beyond the scope of current research.