Black-box machine learning models are criticized as lacking interpretability, although they tend to have good prediction accuracy. Knowledge Distillation (KD) is an emerging tool to interpret the black-box model by distilling its knowledge into a transparent model. With well-known advantages in interpretation, decision tree is a competitive candidate of the transparent model. However, theoretical or empirical understanding for the decision tree generated from KD process is limited. In this paper, we name this kind of decision tree the distillation decision tree (DDT) and lay the theoretical foundations for tree structure stability which determines the validity of DDT's interpretation. We prove that the structure of DDT can achieve stable (convergence) under some mild assumptions. Meanwhile, we develop algorithms for stabilizing the induction of DDT, propose parallel strategies for improving algorithm's computational efficiency, and introduce a marginal principal component analysis method for overcoming the curse of dimensionality in sampling. Simulated and real data studies justify our theoretical results, validate the efficacy of algorithms, and demonstrate that DDT can strike a good balance between model's prediction accuracy and interpretability.
Sorting operation is one of the main bottlenecks for the successive-cancellation list (SCL) decoding. This paper introduces an improvement to the SCL decoding for polar and pre-transformed polar codes that reduces the number of sorting operations without degrading the code's error-correction performance. In an SCL decoding with an optimum metric function we show that, on average, the correct branch's bit-metric value must be equal to the bit-channel capacity, and on the other hand, the average bit-metric value of a wrong branch can be at most zero. This implies that a wrong path's partial path metric value deviates from the bit-channel capacity's partial summation. For relatively reliable bit-channels, the bit metric for a wrong branch becomes very large negative number, which enables us to detect and prune such paths. We prove that, for a threshold lower than the bit-channel cutoff rate, the probability of pruning the correct path decreases exponentially by the given threshold. Based on these findings, we presented a pruning technique, and the experimental results demonstrate a substantial decrease in the amount of sorting procedures required for SCL decoding. In the stack algorithm, a similar technique is used to significantly reduce the average number of paths in the stack.
Adopting a good health information system (HIS) is essential for providing high-quality healthcare. With rapid advances in technology in the healthcare industry in recent years, healthcare providers seek effective options to deal with numerous diseases and a growing number of patients, adopting advanced HIS such as for clinical decision support. While the clinical decision support systems (CDSS) can help medical personnel make better decisions, they may bring negative results due to a lack of understanding of the elements that influence GP's adoption of CDSS. This paper focuses on discovering obstacles that may contribute to the problems surrounding CDSS adoption. Thirty general practitioners were interviewed from different primary health centers in Saudi Arabia in order to determine the challenges and obstacles in the sector. While the outcome confirms that there are obstacles that affect the aspects, such as time risk, quality of the system used, slow Internet speed, user interface, lack of training, high costs, patient satisfaction, multiple systems used, technical support, computer skills, lack of flexibility, system update, professional skills and knowledge, computer efficiency and quality and accuracy of data.
Most of existing methods for few-shot object detection follow the fine-tuning paradigm, which potentially assumes that the class-agnostic generalizable knowledge can be learned and transferred implicitly from base classes with abundant samples to novel classes with limited samples via such a two-stage training strategy. However, it is not necessarily true since the object detector can hardly distinguish between class-agnostic knowledge and class-specific knowledge automatically without explicit modeling. In this work we propose to learn three types of class-agnostic commonalities between base and novel classes explicitly: recognition-related semantic commonalities, localization-related semantic commonalities and distribution commonalities. We design a unified distillation framework based on a memory bank, which is able to perform distillation of all three types of commonalities jointly and efficiently. Extensive experiments demonstrate that our method can be readily integrated into most of existing fine-tuning based methods and consistently improve the performance by a large margin.
Causal effect estimation from observational data is a challenging problem, especially with high dimensional data and in the presence of unobserved variables. The available data-driven methods for tackling the problem either provide an estimation of the bounds of a causal effect (i.e. nonunique estimation) or have low efficiency. The major hurdle for achieving high efficiency while trying to obtain unique and unbiased causal effect estimation is how to find a proper adjustment set for confounding control in a fast way, given the huge covariate space and considering unobserved variables. In this paper, we approach the problem as a local search task for finding valid adjustment sets in data. We establish the theorems to support the local search for adjustment sets, and we show that unique and unbiased estimation can be achieved from observational data even when there exist unobserved variables. We then propose a data-driven algorithm that is fast and consistent under mild assumptions. We also make use of a frequent pattern mining method to further speed up the search of minimal adjustment sets for causal effect estimation. Experiments conducted on extensive synthetic and real-world datasets demonstrate that the proposed algorithm outperforms the state-of-the-art criteria/estimators in both accuracy and time-efficiency.
Decision-guided perspectives on model uncertainty expand traditional statistical thinking about managing, comparing and combining inferences from sets of models. Bayesian predictive decision synthesis (BPDS) advances conceptual and theoretical foundations, and defines new methodology that explicitly integrates decision-analytic outcomes into the evaluation, comparison and potential combination of candidate models. BPDS extends recent theoretical and practical advances based on both Bayesian predictive synthesis and empirical goal-focused model uncertainty analysis. This is enabled by development of a novel subjective Bayesian perspective on model weighting in predictive decision settings. Illustrations come from applied contexts including optimal design for regression prediction and sequential time series forecasting for financial portfolio decisions.
As progress in AI continues to advance, it is crucial to know how advanced systems will make choices and in what ways they may fail. Machines can already outsmart humans in some domains, and understanding how to safely build ones which may have capabilities at or above the human level is of particular concern. One might suspect that artificially generally intelligent (AGI) and artificially superintelligent (ASI) systems should be modeled as as something which humans, by definition, can't reliably outsmart. As a challenge to this assumption, this paper presents the Achilles Heel hypothesis which states that even a potentially superintelligent system may nonetheless have stable decision-theoretic delusions which cause them to make obviously irrational decisions in adversarial settings. In a survey of relevant dilemmas and paradoxes from the decision theory literature, a number of these potential Achilles Heels are discussed in context of this hypothesis. Several novel contributions are made toward understanding the ways in which these weaknesses might be implanted into a system.
Model Predictive Control (MPC) approaches are widely used in robotics, since they allow to compute updated trajectories while the robot is moving. They generally require heuristic references for the tracking terms and proper tuning of parameters of the cost function in order to obtain good performance. When for example, a legged robot has to react to disturbances from the environment (e.g., recover after a push) or track a certain goal with statically unstable gaits, the effectiveness of the algorithm can degrade. In this work we propose a novel optimization-based Reference Generator, named Governor, which exploits a Linear Inverted Pendulum model to compute reference trajectories for the Center of Mass, while taking into account the possible under-actuation of a gait (e.g. in a trot). The obtained trajectories are used as references for the cost function of the Nonlinear MPC presented in our previous work [1]. We also present a formulation that can guarantee a certain response time to reach a goal, without the need to tune the weights of the cost terms. In addition, foothold locations are corrected to drive the robot towards the goal. We demonstrate the effectiveness of our approach both in simulations and experiments in different scenarios with the Aliengo robot.
Formal XAI (explainable AI) is a growing area that focuses on computing explanations with mathematical guarantees for the decisions made by ML models. Inside formal XAI, one of the most studied cases is that of explaining the choices taken by decision trees, as they are traditionally deemed as one of the most interpretable classes of models. Recent work has focused on studying the computation of "sufficient reasons", a kind of explanation in which given a decision tree $T$ and an instance $x$, one explains the decision $T(x)$ by providing a subset $y$ of the features of $x$ such that for any other instance $z$ compatible with $y$, it holds that $T(z) = T(x)$, intuitively meaning that the features in $y$ are already enough to fully justify the classification of $x$ by $T$. It has been argued, however, that sufficient reasons constitute a restrictive notion of explanation, and thus the community has started to study their probabilistic counterpart, in which one requires that the probability of $T(z) = T(x)$ must be at least some value $\delta \in (0, 1]$, where $z$ is a random instance that is compatible with $y$. Our paper settles the computational complexity of $\delta$-sufficient-reasons over decision trees, showing that both (1) finding $\delta$-sufficient-reasons that are minimal in size, and (2) finding $\delta$-sufficient-reasons that are minimal inclusion-wise, do not admit polynomial-time algorithms (unless P=NP). This is in stark contrast with the deterministic case ($\delta = 1$) where inclusion-wise minimal sufficient-reasons are easy to compute. By doing this, we answer two open problems originally raised by Izza et al. On the positive side, we identify structural restrictions of decision trees that make the problem tractable, and show how SAT solvers might be able to tackle these problems in practical settings.
Knowledge Distillation (KD) is a widely-used technology to inherit information from cumbersome teacher models to compact student models, consequently realizing model compression and acceleration. Compared with image classification, object detection is a more complex task, and designing specific KD methods for object detection is non-trivial. In this work, we elaborately study the behaviour difference between the teacher and student detection models, and obtain two intriguing observations: First, the teacher and student rank their detected candidate boxes quite differently, which results in their precision discrepancy. Second, there is a considerable gap between the feature response differences and prediction differences between teacher and student, indicating that equally imitating all the feature maps of the teacher is the sub-optimal choice for improving the student's accuracy. Based on the two observations, we propose Rank Mimicking (RM) and Prediction-guided Feature Imitation (PFI) for distilling one-stage detectors, respectively. RM takes the rank of candidate boxes from teachers as a new form of knowledge to distill, which consistently outperforms the traditional soft label distillation. PFI attempts to correlate feature differences with prediction differences, making feature imitation directly help to improve the student's accuracy. On MS COCO and PASCAL VOC benchmarks, extensive experiments are conducted on various detectors with different backbones to validate the effectiveness of our method. Specifically, RetinaNet with ResNet50 achieves 40.4% mAP in MS COCO, which is 3.5% higher than its baseline, and also outperforms previous KD methods.
This paper focuses on the expected difference in borrower's repayment when there is a change in the lender's credit decisions. Classical estimators overlook the confounding effects and hence the estimation error can be magnificent. As such, we propose another approach to construct the estimators such that the error can be greatly reduced. The proposed estimators are shown to be unbiased, consistent, and robust through a combination of theoretical analysis and numerical testing. Moreover, we compare the power of estimating the causal quantities between the classical estimators and the proposed estimators. The comparison is tested across a wide range of models, including linear regression models, tree-based models, and neural network-based models, under different simulated datasets that exhibit different levels of causality, different degrees of nonlinearity, and different distributional properties. Most importantly, we apply our approaches to a large observational dataset provided by a global technology firm that operates in both the e-commerce and the lending business. We find that the relative reduction of estimation error is strikingly substantial if the causal effects are accounted for correctly.