Sparse decision trees are one of the most common forms of interpretable models. While recent advances have produced algorithms that fully optimize sparse decision trees for prediction, that work does not address policy design, because the algorithms cannot handle weighted data samples. Specifically, they rely on the discreteness of the loss function, which means that real-valued weights cannot be directly used. For example, none of the existing techniques produce policies that incorporate inverse propensity weighting on individual data points. We present three algorithms for efficient sparse weighted decision tree optimization. The first approach directly optimizes the weighted loss function; however, it tends to be computationally inefficient for large datasets. Our second approach, which scales more efficiently, transforms weights to integer values and uses data duplication to transform the weighted decision tree optimization problem into an unweighted (but larger) counterpart. Our third algorithm, which scales to much larger datasets, uses a randomized procedure that samples each data point with a probability proportional to its weight. We present theoretical bounds on the error of the two fast methods and show experimentally that these methods can be two orders of magnitude faster than the direct optimization of the weighted loss, without losing significant accuracy.
Estimating the entropy rate of discrete time series is a challenging problem with important applications in numerous areas including neuroscience, genomics, image processing and natural language processing. A number of approaches have been developed for this task, typically based either on universal data compression algorithms, or on statistical estimators of the underlying process distribution. In this work, we propose a fully-Bayesian approach for entropy estimation. Building on the recently introduced Bayesian Context Trees (BCT) framework for modelling discrete time series as variable-memory Markov chains, we show that it is possible to sample directly from the induced posterior on the entropy rate. This can be used to estimate the entire posterior distribution, providing much richer information than point estimates. We develop theoretical results for the posterior distribution of the entropy rate, including proofs of consistency and asymptotic normality. The practical utility of the method is illustrated on both simulated and real-world data, where it is found to outperform state-of-the-art alternatives.
In recent years there has been growing attention to interpretable machine learning models which can give explanatory insights on their behavior. Thanks to their interpretability, decision trees have been intensively studied for classification tasks, and due to the remarkable advances in mixed-integer programming (MIP), various approaches have been proposed to formulate the problem of training an Optimal Classification Tree (OCT) as a MIP model. We present a novel mixed-integer quadratic formulation for the OCT problem, which exploits the generalization capabilities of Support Vector Machines for binary classification. Our model, denoted as Margin Optimal Classification Tree (MARGOT), encompasses the use of maximum margin multivariate hyperplanes nested in a binary tree structure. To enhance the interpretability of our approach, we analyse two alternative versions of MARGOT, which include feature selection constraints inducing local sparsity of the hyperplanes. First, MARGOT has been tested on non-linearly separable synthetic datasets in 2-dimensional feature space to provide a graphical representation of the maximum margin approach. Finally, the proposed models have been tested on benchmark datasets from the UCI repository. The MARGOT formulation turns out to be easier to solve than other OCT approaches, and the generated tree better generalizes on new observations. The two interpretable versions are effective in selecting the most relevant features and maintaining good prediction quality.
Methods for learning optimal policies use causal machine learning models to create human-interpretable rules for making choices around the allocation of different policy interventions. However, in realistic policy-making contexts, decision-makers often care about trade-offs between outcomes, not just singlemindedly maximising utility for one outcome. This paper proposes an approach termed Multi-Objective Policy Learning (MOPoL) which combines optimal decision trees for policy learning with a multi-objective Bayesian optimisation approach to explore the trade-off between multiple outcomes. It does this by building a Pareto frontier of non-dominated models for different hyperparameter settings. The key here is that a low-cost surrogate function can be an accurate proxy for the very computationally costly optimal tree in terms of expected regret. This surrogate can be fit many times with different hyperparameter values to proxy the performance of the optimal model. The method is applied to a real-world case-study of conditional cash transfers in Morocco where hybrid (partially optimal, partially greedy) policy trees provide good performance as a surrogate for optimal trees while being computationally cheap enough to feasibly fit a Pareto frontier.
Weighted round robin (WRR) is an effective, yet particularly easy-to-implement packet scheduler. A slight modification in the implementation of WRR, interleaved weighted round robin, has been proposed as an enhancement of the initial version and has been recently investigated. Network calculus is a versatile framework to model and analyze such network schedulers. By means of this, one can derive theoretical upper bounds on network performance metrics, such as delay or backlog. In our previous work, we derive performance bounds by showing that both round-robin variants belong to a class called bandwidth-sharing policy; however, the proofs are incomplete and thus, we cannot conclude that the round-robin schedulers are bandwidth-sharing policies (under variable packet sizes).To that end, in the subsequent erratum, we introduce so-called resource-segregating policies and show the round-robin schedulers to be members of this class. We first present our original work, as published in [CNS22-1], and then the erratum correcting the previously mentioned shortcoming. In our erratum, we provide slightly worse delay bounds compared to [CNS22-1]; yet, across all our experiments, they significantly outperform the state of the art.
Transfer learning aims to improve the performance of a target model by leveraging data from related source populations, which is known to be especially helpful in cases with insufficient target data. In this paper, we study the problem of how to train a high-dimensional ridge regression model using limited target data and existing regression models trained in heterogeneous source populations. We consider a practical setting where only the parameter estimates of the fitted source models are accessible, instead of the individual-level source data. Under the setting with only one source model, we propose a novel flexible angle-based transfer learning (angleTL) method, which leverages the concordance between the source and the target model parameters. We show that angleTL unifies several benchmark methods by construction, including the target-only model trained using target data alone, the source model fitted on source data, and distance-based transfer learning method that incorporates the source parameter estimates and the target data under a distance-based similarity constraint. We also provide algorithms to effectively incorporate multiple source models accounting for the fact that some source models may be more helpful than others. Our high-dimensional asymptotic analysis provides interpretations and insights regarding when a source model can be helpful to the target model, and demonstrates the superiority of angleTL over other benchmark methods. We perform extensive simulation studies to validate our theoretical conclusions and show the feasibility of applying angleTL to transfer existing genetic risk prediction models across multiple biobanks.
We study the sparse high-dimensional Gaussian mixture model when the number of clusters is allowed to grow with the sample size. A minimax lower bound for parameter estimation is established, and we show that a constrained maximum likelihood estimator achieves the minimax lower bound. However, this optimization-based estimator is computationally intractable because the objective function is highly nonconvex and the feasible set involves discrete structures. To address the computational challenge, we propose a Bayesian approach to estimate high-dimensional Gaussian mixtures whose cluster centers exhibit sparsity using a continuous spike-and-slab prior. Posterior inference can be efficiently computed using an easy-to-implement Gibbs sampler. We further prove that the posterior contraction rate of the proposed Bayesian method is minimax optimal. The mis-clustering rate is obtained as a by-product using tools from matrix perturbation theory. The proposed Bayesian sparse Gaussian mixture model does not require pre-specifying the number of clusters, which can be adaptively estimated via the Gibbs sampler. The validity and usefulness of the proposed method is demonstrated through simulation studies and the analysis of a real-world single-cell RNA sequencing dataset.
As causal inference becomes more widespread the importance of having good tools to test for causal effects increases. In this work we focus on the problem of testing for causal effects that manifest in a difference in distribution for treatment and control. We build on work applying kernel methods to causality, considering the previously introduced Counterfactual Mean Embedding framework (\textsc{CfME}). We improve on this by proposing the \emph{Doubly Robust Counterfactual Mean Embedding} (\textsc{DR-CfME}), which has better theoretical properties than its predecessor by leveraging semiparametric theory. This leads us to propose new kernel based test statistics for distributional effects which are based upon doubly robust estimators of treatment effects. We propose two test statistics, one which is a direct improvement on previous work and one which can be applied even when the support of the treatment arm is a subset of that of the control arm. We demonstrate the validity of our methods on simulated and real-world data, as well as giving an application in off-policy evaluation.
This article develops a convex description of a classical or quantum learner's or agent's state of knowledge about its environment, presented as a convex subset of a commutative R-algebra. With caveats, this leads to a generalization of certain semidefinite programs in quantum information (such as those describing the universal query algorithm dual to the quantum adversary bound, related to optimal learning or control of the environment) to the classical and faulty-quantum setting, which would not be possible with a naive description via joint probability distributions over environment and internal memory. More philosophically, it also makes an interpretation of the set of reduced density matrices as "states of knowledge" of an observer of its environment, related to these techniques, more explicit. As another example, I describe and solve a formal differential equation of states of knowledge in that algebra, where an agent obtains experimental data in a Poissonian process, and its state of knowledge evolves as an exponential power series. However, this framework currently lacks impressive applications, and I post it in part to solicit feedback and collaboration on those. In particular, it may be possible to develop it into a new framework for the design of experiments, e.g. the problem of finding maximally informative questions to ask human labelers or the environment in machine-learning problems. The parts of the article not related to quantum information don't assume knowledge of it.
Motion planning and control in autonomous car racing are one of the most challenging and safety-critical tasks due to high speed and dynamism. The lower-level control nodes are expected to be highly optimized due to resource constraints of onboard embedded processing units, although there are strict latency requirements. Some of these guarantees can be provided at the application level, such as using ROS2's Real-Time executors. However, the performance can be far from satisfactory as many modern control algorithms (such as Model Predictive Control) rely on solving complicated online optimization problems at each iteration. In this paper, we present a simple yet effective multi-threading technique to optimize the throughput of online-control algorithms for resource-constrained autonomous racing platforms. We achieve this by maintaining a systematic pool of worker threads solving the optimization problem in parallel which can improve the system performance by reducing latency between control input commands. We further demonstrate the effectiveness of our method using the Model Predictive Contouring Control (MPCC) algorithm running on Nvidia's Xavier AGX platform.
Deployment of Internet of Things (IoT) devices and Data Fusion techniques have gained popularity in public and government domains. This usually requires capturing and consolidating data from multiple sources. As datasets do not necessarily originate from identical sensors, fused data typically results in a complex data problem. Because military is investigating how heterogeneous IoT devices can aid processes and tasks, we investigate a multi-sensor approach. Moreover, we propose a signal to image encoding approach to transform information (signal) to integrate (fuse) data from IoT wearable devices to an image which is invertible and easier to visualize supporting decision making. Furthermore, we investigate the challenge of enabling an intelligent identification and detection operation and demonstrate the feasibility of the proposed Deep Learning and Anomaly Detection models that can support future application that utilizes hand gesture data from wearable devices.