Heterogeneity is a hallmark of complex diseases. Regression-based heterogeneity analysis, which is directly concerned with outcome-feature relationships, has led to a deeper understanding of disease biology. Such an analysis identifies the underlying subgroup structure and estimates the subgroup-specific regression coefficients. However, most of the existing regression-based heterogeneity analyses can only address disjoint subgroups; that is, each sample is assigned to only one subgroup. In reality, some samples have multiple labels, for example, many genes have several biological functions, and some cells of pure cell types transition into other types over time, which suggest that their outcome-feature relationships (regression coefficients) can be a mixture of relationships in more than one subgroups, and as a result, the disjoint subgrouping results can be unsatisfactory. To this end, we develop a novel approach to regression-based heterogeneity analysis, which takes into account possible overlaps between subgroups and high data dimensions. A subgroup membership vector is introduced for each sample, which is combined with a loss function. Considering the lack of information arising from small sample sizes, an $l_2$ norm penalty is developed for each membership vector to encourage similarity in its elements. A sparse penalization is also applied for regularized estimation and feature selection. Extensive simulations demonstrate its superiority over direct competitors. The analysis of Cancer Cell Line Encyclopedia data and lung cancer data from The Cancer Genome Atlas shows that the proposed approach can identify an overlapping subgroup structure with favorable performance in prediction and stability.
Mounting compact and lightweight base stations on unmanned aerial vehicles (UAVs) is a cost-effective and flexible solution to provide seamless coverage on the existing terrestrial networks. While the coverage probability in UAV-assisted cellular networks has been widely investigated, it provides only the first-order statistic of signal-to-interference-plus-noise ratio (SINR). In this paper, to analyze high-order statistics of SINR and characterize the disparity among individual links, we provide a meta distribution (MD)-based analytical framework for UAV-assisted cellular networks, in which the probabilistic line-of-sight channel and realistic antenna pattern are taken into account for air-to-ground transmissions. To accurately characterize the interference from UAVs, we relax the widely applied uniform off-boresight angle (OBA) assumption and derive the exact distribution of OBA. Using stochastic geometry, for both steerable and vertical antenna scenarios, we obtain mathematical expressions for the moments of condition success probability, the SINR MD, and the mean local delay. Moreover, we study the asymptotic behavior of the moments as network density approaches infinity. Numerical results validate the tightness of the theoretical results and show that the uniform OBA assumption underestimates the network performance, especially in the regime of moderate altitude of UAV. We also show that when UAVs are equipped with steerable antennas, the network coverage and user fairness can be optimized simultaneously by carefully adjusting the UAV parameters.
Prophet inequalities consist of many beautiful statements that establish tight performance ratios between online and offline allocation algorithms. Typically, tightness is established by constructing an algorithmic guarantee and a worst-case instance separately, whose bounds match as a result of some "ingenuity". In this paper, we instead formulate the construction of the worst-case instance as an optimization problem, which directly finds the tight ratio without needing to construct two bounds separately. Our analysis of this complex optimization problem involves identifying the structure in a new "Type Coverage" dual problem. It can be seen as akin to the celebrated Magician and OCRS problems, except more general in that it can also provide tight ratios relative to the optimal offline allocation, whereas the earlier problems only concerns the ex-ante relaxation of the offline problem. Through this analysis, our paper provides a unified framework that derives new prophet inequalities and recovers existing ones, including two important new results. First, we show that the "oblivious" method of setting a static threshold due to Chawla et al. (2020), surprisingly, is best-possible among all static threshold algorithms, under any number $k$ of units. We emphasize that this result is derived without needing to explicitly find any counterexample instances. Our result implies that the asymptotic convergence rate of $1-O(\sqrt{\log k/k})$ for static threshold algorithms, first established in Hajiaghayi et al. (2007), is tight; this confirms for the first time a separation with the convergence rate of adaptive algorithms, which is $1-\Theta(\sqrt{1/k})$ due to Alaei (2014). Second, turning to the IID setting, our framework allows us to numerically illustrate the tight guarantee (of adaptive algorithms) under any number $k$ of starting units. Our guarantees for $k>1$ exceed the state-of-the-art.
Heterogeneity and comorbidity are two interwoven challenges associated with various healthcare problems that greatly hampered research on developing effective treatment and understanding of the underlying neurobiological mechanism. Very few studies have been conducted to investigate heterogeneous causal effects (HCEs) in graphical contexts due to the lack of statistical methods. To characterize this heterogeneity, we first conceptualize heterogeneous causal graphs (HCGs) by generalizing the causal graphical model with confounder-based interactions and multiple mediators. Such confounders with an interaction with the treatment are known as moderators. This allows us to flexibly produce HCGs given different moderators and explicitly characterize HCEs from the treatment or potential mediators on the outcome. We establish the theoretical forms of HCEs and derive their properties at the individual level in both linear and nonlinear models. An interactive structural learning is developed to estimate the complex HCGs and HCEs with confidence intervals provided. Our method is empirically justified by extensive simulations and its practical usefulness is illustrated by exploring causality among psychiatric disorders for trauma survivors.
High-dimensional feature selection is a central problem in a variety of application domains such as machine learning, image analysis, and genomics. In this paper, we propose graph-based tests as a useful basis for feature selection. We describe an algorithm for selecting informative features in high-dimensional data, where each observation comes from one of $K$ different distributions. Our algorithm can be applied in a completely nonparametric setup without any distributional assumptions on the data, and it aims at outputting those features in the data, that contribute the most to the overall distributional variation. At the heart of our method is the recursive application of distribution-free graph-based tests on subsets of the feature set, located at different depths of a hierarchical clustering tree constructed from the data. Our algorithm recovers all truly contributing features with high probability, while ensuring optimal control on false-discovery. Finally, we show the superior performance of our method over other existing ones through synthetic data, and also demonstrate the utility of the method on two real-life datasets from the domains of climate change and single cell transcriptomics.
Most personalised federated learning (FL) approaches assume that raw data of all clients are defined in a common subspace i.e. all clients store their data according to the same schema. For real-world applications, this assumption is restrictive as clients, having their own systems to collect and then store data, may use heterogeneous data representations. We aim at filling this gap. To this end, we propose a general framework coined FLIC that maps client's data onto a common feature space via local embedding functions. The common feature space is learnt in a federated manner using Wasserstein barycenters while the local embedding functions are trained on each client via distribution alignment. We integrate this distribution alignement mechanism into a federated learning approach and provide the algorithmics of FLIC. We compare its performances against FL benchmarks involving heterogeneous input features spaces. In addition, we provide theoretical insights supporting the relevance of our methodology.
The industrialization of catalytic processes is of far more importance today than it has ever been before and kinetic models are essential tools for their industrialization. Kinetic models affect the design, the optimization and the control of catalytic processes, but they are not easy to obtain. Classical paradigms, such as mechanistic modeling require substantial domain knowledge, while data-driven and hybrid modeling lack interpretability. Consequently, a different approach called automated knowledge discovery has recently gained popularity. Many methods under this paradigm have been developed, where ALAMO, SINDy and genetic programming are notable examples. However, these methods suffer from important drawbacks: they require assumptions about model structures, scale poorly, lack robust and well-founded model selection routines, and they are sensitive to noise. To overcome these challenges, the present work constructs two methodological frameworks, Automated Discovery of Kinetics using a Strong/Weak formulation of symbolic regression, ADoK-S and ADoK-W, for the automated generation of catalytic kinetic models. We leverage genetic programming for model generation, a sequential optimization routine for model refinement, and a robust criterion for model selection. Both frameworks are tested against three computational case studies of increasing complexity. We showcase their ability to retrieve the underlying kinetic rate model with a limited amount of noisy data from the catalytic system, indicating a strong potential for chemical reaction engineering applications.
The recent literature on online learning to rank (LTR) has established the utility of prior knowledge to Bayesian ranking bandit algorithms. However, a major limitation of existing work is the requirement for the prior used by the algorithm to match the true prior. In this paper, we propose and analyze adaptive algorithms that address this issue and additionally extend these results to the linear and generalized linear models. We also consider scalar relevance feedback on top of click feedback. Moreover, we demonstrate the efficacy of our algorithms using both synthetic and real-world experiments.
The geometric optimisation of crystal structures is a procedure widely used in Chemistry that changes the geometrical placement of the particles inside a structure. It is called structural relaxation and constitutes a local minimization problem with a non-convex objective function whose domain complexity increases along with the number of particles involved. In this work we study the performance of the two most popular first order optimisation methods, Gradient Descent and Conjugate Gradient, in structural relaxation. The respective pseudocodes can be found in Section 6. Although frequently employed, there is a lack of their study in this context from an algorithmic point of view. In order to accurately define the problem, we provide a thorough derivation of all necessary formulae related to the crystal structure energy function and the function's differentiation. We run each algorithm in combination with a constant step size, which provides a benchmark for the methods' analysis and direct comparison. We also design dynamic step size rules and study how these improve the two algorithms' performance. Our results show that there is a trade-off between convergence rate and the possibility of an experiment to succeed, hence we construct a function to assign utility to each method based on our respective preference. The function is built according to a recently introduced model of preference indication concerning algorithms with deadline and their run time. Finally, building on all our insights from the experimental results, we provide algorithmic recipes that best correspond to each of the presented preferences and select one recipe as the optimal for equally weighted preferences.
Federated Reinforcement Learning (FedRL) encourages distributed agents to learn collectively from each other's experience to improve their performance without exchanging their raw trajectories. The existing work on FedRL assumes that all participating agents are homogeneous, which requires all agents to share the same policy parameterization (e.g., network architectures and training configurations). However, in real-world applications, agents are often in disagreement about the architecture and the parameters, possibly also because of disparate computational budgets. Because homogeneity is not given in practice, we introduce the problem setting of Federated Reinforcement Learning with Heterogeneous And bLack-box agEnts (FedRL-HALE). We present the unique challenges this new setting poses and propose the Federated Heterogeneous Q-Learning (FedHQL) algorithm that principally addresses these challenges. We empirically demonstrate the efficacy of FedHQL in boosting the sample efficiency of heterogeneous agents with distinct policy parameterization using standard RL tasks.
Scatterplots are a common tool for exploring multidimensional datasets, especially in the form of scatterplot matrices (SPLOMs). However, scatterplots suffer from overplotting when categorical variables are mapped to one or two axes, or the same continuous variable is used for both axes. Previous methods such as histograms or violin plots use aggregation, which makes brushing and linking difficult. To address this, we propose gatherplots, an extension of scatterplots to manage the overplotting problem. Gatherplots are a form of unit visualization, which avoid aggregation and maintain the identity of individual objects to ease visual perception. In gatherplots, every visual mark that maps to the same position coalesces to form a packed entity, thereby making it easier to see the overview of data groupings. The size and aspect ratio of marks can also be changed dynamically to make it easier to compare the composition of different groups. In the case of a categorical variable vs. a categorical variable, we propose a heuristic to decide bin sizes for optimal space usage. To validate our work, we conducted a crowdsourced user study that shows that gatherplots enable people to assess data distribution more quickly and more correctly than when using jittered scatterplots.