Decision tree learning is a widely used approach in machine learning, favoured in applications that require concise and interpretable models. Heuristic methods are traditionally used to quickly produce models with reasonably high accuracy. A commonly criticised point, however, is that the resulting trees may not necessarily be the best representation of the data in terms of accuracy and size. In recent years, this motivated the development of optimal classification tree algorithms that globally optimise the decision tree in contrast to heuristic methods that perform a sequence of locally optimal decisions. We follow this line of work and provide a novel algorithm for learning optimal classification trees based on dynamic programming and search. Our algorithm supports constraints on the depth of the tree and number of nodes. The success of our approach is attributed to a series of specialised techniques that exploit properties unique to classification trees. Whereas algorithms for optimal classification trees have traditionally been plagued by high runtimes and limited scalability, we show in a detailed experimental study that our approach uses only a fraction of the time required by the state-of-the-art and can handle datasets with tens of thousands of instances, providing several orders of magnitude improvements and notably contributing towards the practical realisation of optimal decision trees.
Consider any locally checkable labeling problem $\Pi$ in rooted regular trees: there is a finite set of labels $\Sigma$, and for each label $x \in \Sigma$ we specify what are permitted label combinations of the children for an internal node of label $x$ (the leaf nodes are unconstrained). This formalism is expressive enough to capture many classic problems studied in distributed computing, including vertex coloring, edge coloring, and maximal independent set. We show that the distributed computational complexity of any such problem $\Pi$ falls in one of the following classes: it is $O(1)$, $\Theta(\log^* n)$, $\Theta(\log n)$, or $n^{\Theta(1)}$ rounds in trees with $n$ nodes (and all of these classes are nonempty). We show that the complexity of any given problem is the same in all four standard models of distributed graph algorithms: deterministic $\mathsf{LOCAL}$, randomized $\mathsf{LOCAL}$, deterministic $\mathsf{CONGEST}$, and randomized $\mathsf{CONGEST}$ model. In particular, we show that randomness does not help in this setting, and the complexity class $\Theta(\log \log n)$ does not exist (while it does exist in the broader setting of general trees). We also show how to systematically determine the complexity class of any such problem $\Pi$, i.e., whether $\Pi$ takes $O(1)$, $\Theta(\log^* n)$, $\Theta(\log n)$, or $n^{\Theta(1)}$ rounds. While the algorithm may take exponential time in the size of the description of $\Pi$, it is nevertheless practical: we provide a freely available implementation of the classifier algorithm, and it is fast enough to classify many problems of interest.
In the past a few years, many interesting inapproximability results have been obtained from the parameterized perspective. This article surveys some of such results, with a focus on $k$-Clique, $k$-SetCover, and other related problems.
Online learning algorithms have become a ubiquitous tool in the machine learning toolbox and are frequently used in small, resource-constraint environments. Among the most successful online learning methods are Decision Tree (DT) ensembles. DT ensembles provide excellent performance while adapting to changes in the data, but they are not resource efficient. Incremental tree learners keep adding new nodes to the tree but never remove old ones increasing the memory consumption over time. Gradient-based tree learning, on the other hand, requires the computation of gradients over the entire tree which is costly for even moderately sized trees. In this paper, we propose a novel memory-efficient online classification ensemble called shrub ensembles for resource-constraint systems. Our algorithm trains small to medium-sized decision trees on small windows and uses stochastic proximal gradient descent to learn the ensemble weights of these `shrubs'. We provide a theoretical analysis of our algorithm and include an extensive discussion on the behavior of our approach in the online setting. In a series of 2~959 experiments on 12 different datasets, we compare our method against 8 state-of-the-art methods. Our Shrub Ensembles retain an excellent performance even when only little memory is available. We show that SE offers a better accuracy-memory trade-off in 7 of 12 cases, while having a statistically significant better performance than most other methods. Our implementation is available under //github.com/sbuschjaeger/se-online .
In the online bin packing problem, a sequence of items is revealed one at a time, and each item must be packed into an available bin instantly upon its arrival. In this paper, we revisit the problem under a setting where the total number of items T is known in advance, also known as the closed online bin packing problem. Specifically, we study both the stochastic model and the random permutation model. We develop and analyze an adaptive algorithm that solves an offline bin packing problem at geometric time intervals and uses the offline optimal solution to guide online packing decisions. Under both models, we show that the algorithm achieves C\sqrt{T} regret (in terms of the number of used bins) compared to the hindsight optimal solution, where C is a universal constant (<= 13) that bears no dependence on the underlying distribution or the item sizes. The result shows the lower bound barrier of \Omega(\sqrt{T \log T}) in Shor (1986) can be broken with solely the knowledge of the horizon T, but without a need of knowing the underlying distribution. As to the algorithm analysis, we develop a new approach to analyzing the packing dynamic using the notion of exchangeable random variables. The approach creates a symmetrization between the offline solution and the online solution, and it is used to analyze both the algorithm performance and various benchmarks related to the bin packing problem. For the latter one, our analysis provides an alternative (probably simpler) treatment and tightens the analysis of the asymptotic benchmark of the stochastic bin packing problem in Rhee and Talagrand (1989a,b). As the analysis only relies on a symmetry between the offline and online problems, the algorithm and benchmark analyses can be naturally extended from the stochastic model to the random permutation model.
Since conventional approaches could not adapt to dynamic traffic conditions, reinforcement learning (RL) has attracted more attention to help solve the traffic signal control (TSC) problem. However, existing RL-based methods are rarely deployed considering that they are neither cost-effective in terms of computing resources nor more robust than traditional approaches, which raises a critical research question: how to construct an adaptive controller for TSC with less training and reduced complexity based on RL-based approach? To address this question, in this paper, we (1) innovatively specify the traffic movement representation as a simple but efficient pressure of vehicle queues in a traffic network, namely efficient pressure (EP); (2) build a traffic signal settings protocol, including phase duration, signal phase number and EP for TSC; (3) design a TSC approach based on the traditional max pressure (MP) approach, namely efficient max pressure (Efficient-MP) using the EP to capture the traffic state; and (4) develop a general RL-based TSC algorithm template: efficient Xlight (Efficient-XLight) under EP. Through comprehensive experiments on multiple real-world datasets in our traffic signal settings' protocol for TSC, we demonstrate that efficient pressure is complementary to traditional and RL-based modeling to design better TSC methods. Our code is released on Github.
Dynamic treatment regimes (DTRs) consist of a sequence of decision rules, one per stage of intervention, that finds effective treatments for individual patients according to patient information history. DTRs can be estimated from models which include the interaction between treatment and a small number of covariates which are often chosen a priori. However, with increasingly large and complex data being collected, it is difficult to know which prognostic factors might be relevant in the treatment rule. Therefore, a more data-driven approach of selecting these covariates might improve the estimated decision rules and simplify models to make them easier to interpret. We propose a variable selection method for DTR estimation using penalized dynamic weighted least squares. Our method has the strong heredity property, that is, an interaction term can be included in the model only if the corresponding main terms have also been selected. Through simulations, we show our method has both the double robustness property and the oracle property, and the newly proposed methods compare favorably with other variable selection approaches.
Traceless Genetic Programming (TGP) is a new Genetic Programming (GP) that may be used for solving difficult real-world problems. The main difference between TGP and other GP techniques is that TGP does not explicitly store the evolved computer programs. In this paper, TGP is used for solving real-world classification problems taken from PROBEN1. Numerical experiments show that TGP performs similar and sometimes even better than other GP techniques for the considered test problems.
Interpretation of Deep Neural Networks (DNNs) training as an optimal control problem with nonlinear dynamical systems has received considerable attention recently, yet the algorithmic development remains relatively limited. In this work, we make an attempt along this line by reformulating the training procedure from the trajectory optimization perspective. We first show that most widely-used algorithms for training DNNs can be linked to the Differential Dynamic Programming (DDP), a celebrated second-order trajectory optimization algorithm rooted in the Approximate Dynamic Programming. In this vein, we propose a new variant of DDP that can accept batch optimization for training feedforward networks, while integrating naturally with the recent progress in curvature approximation. The resulting algorithm features layer-wise feedback policies which improve convergence rate and reduce sensitivity to hyper-parameter over existing methods. We show that the algorithm is competitive against state-ofthe-art first and second order methods. Our work opens up new avenues for principled algorithmic design built upon the optimal control theory.
Deep neural networks and decision trees operate on largely separate paradigms; typically, the former performs representation learning with pre-specified architectures, while the latter is characterised by learning hierarchies over pre-specified features with data-driven architectures. We unite the two via adaptive neural trees (ANTs), a model that incorporates representation learning into edges, routing functions and leaf nodes of a decision tree, along with a backpropagation-based training algorithm that adaptively grows the architecture from primitive modules (e.g., convolutional layers). ANTs allow increased interpretability via hierarchical clustering, e.g., learning meaningful class associations, such as separating natural vs. man-made objects. We demonstrate this on classification and regression tasks, achieving over 99% and 90% accuracy on the MNIST and CIFAR-10 datasets, and outperforming standard neural networks, random forests and gradient boosted trees on the SARCOS dataset. Furthermore, ANT optimisation naturally adapts the architecture to the size and complexity of the training data.
Accurately classifying malignancy of lesions detected in a screening scan plays a critical role in reducing false positives. Through extracting and analyzing a large numbers of quantitative image features, radiomics holds great potential to differentiate the malignant tumors from benign ones. Since not all radiomic features contribute to an effective classifying model, selecting an optimal feature subset is critical. This work proposes a new multi-objective based feature selection (MO-FS) algorithm that considers both sensitivity and specificity simultaneously as the objective functions during the feature selection. In MO-FS, we developed a modified entropy based termination criterion (METC) to stop the algorithm automatically rather than relying on a preset number of generations. We also designed a solution selection methodology for multi-objective learning using the evidential reasoning approach (SMOLER) to automatically select the optimal solution from the Pareto-optimal set. Furthermore, an adaptive mutation operation was developed to generate the mutation probability in MO-FS automatically. The MO-FS was evaluated for classifying lung nodule malignancy in low-dose CT and breast lesion malignancy in digital breast tomosynthesis. Compared with other commonly used feature selection methods, the experimental results for both lung nodule and breast lesion malignancy classification demonstrated that the feature set by selected MO-FS achieved better classification performance.