Decision trees are widely-used classification and regression models because of their interpretability and good accuracy. Classical methods such as CART are based on greedy approaches but a growing attention has recently been devoted to optimal decision trees. We investigate the nonlinear continuous optimization formulation proposed in Blanquero et al. (EJOR, vol. 284, 2020; COR, vol. 132, 2021) for (sparse) optimal randomized classification trees. Sparsity is important not only for feature selection but also to improve interpretability. We first consider alternative methods to sparsify such trees based on concave approximations of the $l_{0}$ ``norm". Promising results are obtained on 24 datasets in comparison with $l_1$ and $l_{\infty}$ regularizations. Then, we derive bounds on the VC dimension of multivariate randomized classification trees. Finally, since training is computationally challenging for large datasets, we propose a general decomposition scheme and an efficient version of it. Experiments on larger datasets show that the proposed decomposition method is able to significantly reduce the training times without compromising the accuracy.
We propose an algorithm and a new method to tackle the classification problems. We propose a multi-output neural tree (MONT) algorithm, which is an evolutionary learning algorithm trained by the non-dominated sorting genetic algorithm (NSGA)-III. Since evolutionary learning is stochastic, a hypothesis found in the form of MONT is unique for each run of evolutionary learning, i.e., each hypothesis (tree) generated bears distinct properties compared to any other hypothesis both in topological space and parameter-space. This leads to a challenging optimisation problem where the aim is to minimise the tree-size and maximise the classification accuracy. Therefore, the Pareto-optimality concerns were met by hypervolume indicator analysis. We used nine benchmark classification learning problems to evaluate the performance of the MONT. As a result of our experiments, we obtained MONTs which are able to tackle the classification problems with high accuracy. The performance of MONT emerged better over a set of problems tackled in this study compared with a set of well-known classifiers: multilayer perceptron, reduced-error pruning tree, naive Bayes classifier, decision tree, and support vector machine. Moreover, the performances of three versions of MONT's training using genetic programming, NSGA-II, and NSGA-III suggest that the NSGA-III gives the best Pareto-optimal solution.
Let $X$ and $Y$ be two real-valued random variables. Let $(X_{1},Y_{1}),(X_{2},Y_{2}),\ldots$ be independent identically distributed copies of $(X,Y)$. Suppose there are two players A and B. Player A has access to $X_{1},X_{2},\ldots$ and player B has access to $Y_{1},Y_{2},\ldots$. Without communication, what joint probability distributions can players A and B jointly simulate? That is, if $k,m$ are fixed positive integers, what probability distributions on $\{1,\ldots,m\}^{2}$ are equal to the distribution of $(f(X_{1},\ldots,X_{k}),\,g(Y_{1},\ldots,Y_{k}))$ for some $f,g\colon\mathbb{R}^{k}\to\{1,\ldots,m\}$? When $X$ and $Y$ are standard Gaussians with fixed correlation $\rho\in(-1,1)$, we show that the set of probability distributions that can be noninteractively simulated from $k$ Gaussian samples is the same for any $k\geq m^{2}$. Previously, it was not even known if this number of samples $m^{2}$ would be finite or not, except when $m\leq 2$. Consequently, a straightforward brute-force search deciding whether or not a probability distribution on $\{1,\ldots,m\}^{2}$ is within distance $0<\epsilon<|\rho|$ of being noninteractively simulated from $k$ correlated Gaussian samples has run time bounded by $(5/\epsilon)^{m(\log(\epsilon/2) / \log|\rho|)^{m^{2}}}$, improving a bound of Ghazi, Kamath and Raghavendra. A nonlinear central limit theorem (i.e. invariance principle) of Mossel then generalizes this result to decide whether or not a probability distribution on $\{1,\ldots,m\}^{2}$ is within distance $0<\epsilon<|\rho|$ of being noninteractively simulated from $k$ samples of a given finite discrete distribution $(X,Y)$ in run time that does not depend on $k$, with constants that again improve a bound of Ghazi, Kamath and Raghavendra.
Preferential sampling provides a formal modeling specification to capture the effect of bias in a set of sampling locations on inference when a geostatistical model is used to explain observed responses at the sampled locations. In particular, it enables modification of spatial prediction adjusted for the bias. Its original presentation in the literature addressed assessment of the presence of such sampling bias while follow on work focused on regression specification to improve spatial interpolation under such bias. All of the work in the literature to date considers the case of a univariate response variable at each location, either continuous or modeled through a latent continuous variable. The contribution here is to extend the notion of preferential sampling to the case of bivariate response at each location. This exposes sampling scenarios where both responses are observed at a given location as well as scenarios where, for some locations, only one of the responses is recorded. That is, there may be different sampling bias for one response than for the other. It leads to assessing the impact of such bias on co-kriging. It also exposes the possibility that preferential sampling can bias inference regarding dependence between responses at a location. We develop the idea of bivariate preferential sampling through various model specifications and illustrate the effect of these specifications on prediction and dependence behavior. We do this both through simulation examples as well as with a forestry dataset that provides mean diameter at breast height (MDBH) and trees per hectare (TPH) as the point-referenced bivariate responses.
Ensemble methods based on subsampling, such as random forests, are popular in applications due to their high predictive accuracy. Existing literature views a random forest prediction as an infinite-order incomplete U-statistic to quantify its uncertainty. However, these methods focus on a small subsampling size of each tree, which is theoretically valid but practically limited. This paper develops an unbiased variance estimator based on incomplete U-statistics, which allows the tree size to be comparable with the overall sample size, making statistical inference possible in a broader range of real applications. Simulation results demonstrate that our estimators enjoy lower bias and more accurate confidence interval coverage without additional computational costs. We also propose a local smoothing procedure to reduce the variation of our estimator, which shows improved numerical performance when the number of trees is relatively small. Further, we investigate the ratio consistency of our proposed variance estimator under specific scenarios. In particular, we develop a new "double U-statistic" formulation to analyze the Hoeffding decomposition of the estimator's variance.
Finite order Markov models are theoretically well-studied models for dependent categorical data. Despite their generality, application in empirical work when the order is larger than one is quite rare. Practitioners avoid using higher order Markov models because (1) the number of parameters grow exponentially with the order, (2) the interpretation is often difficult. Mixture of transition distribution models (MTD) were introduced to overcome both limitations. MTD represent higher order Markov models as a convex mixture of single step Markov chains, reducing the number of parameters and increasing the interpretability. Nevertheless, in practice, estimation of MTD models with large orders are still limited because of curse of dimensionality and high algorithm complexity. Here, we prove that if only few lags are relevant we can consistently and efficiently recover the lags and estimate the transition probabilities of high order MTD models. The key innovation is a recursive procedure for the selection of the relevant lags of the model. Our results are based on (1) a new structural result of the MTD and (2) an improved martingale concentration inequality. Our theoretical results are illustrated through simulations.
Let $\mathbb{Z}_n = \{Z_1, \ldots, Z_n\}$ be a design; that is, a collection of $n$ points $Z_j \in [-1,1]^d$. We study the quality of quantization of $[-1,1]^d$ by the points of $\mathbb{Z}_n$ and the problem of quality of coverage of $[-1,1]^d$ by ${\cal B}_d(\mathbb{Z}_n,r)$, the union of balls centred at $Z_j \in \mathbb{Z}_n$. We concentrate on the cases where the dimension $d$ is not small ($d\geq 5$) and $n$ is not too large, $n \leq 2^d$. We define the design ${\mathbb{D}_{n,\delta}}$ as a $2^{d-1}$ design defined on vertices of the cube $[-\delta,\delta]^d$, $0\leq \delta\leq 1$. For this design, we derive a closed-form expression for the quantization error and very accurate approximations for {the coverage area} vol$([-1,1]^d \cap {\cal B}_d(\mathbb{Z}_n,r))$. We provide results of a large-scale numerical investigation confirming the accuracy of the developed approximations and the efficiency of the designs ${\mathbb{D}_{n,\delta}}$.
For large classes of group testing problems, we derive lower bounds for the probability that all significant items are uniquely identified using specially constructed random designs. These bounds allow us to optimize parameters of the randomization schemes. We also suggest and numerically justify a procedure of constructing designs with better separability properties than pure random designs. We illustrate theoretical considerations with a large simulation-based study. This study indicates, in particular, that in the case of the common binary group testing, the suggested families of designs have better separability than the popular designs constructed from disjunct matrices. We also derive several asymptotic expansions and discuss the situations when the resulting approximations achieve high accuracy.
Several recent applications of optimal transport (OT) theory to machine learning have relied on regularization, notably entropy and the Sinkhorn algorithm. Because matrix-vector products are pervasive in the Sinkhorn algorithm, several works have proposed to \textit{approximate} kernel matrices appearing in its iterations using low-rank factors. Another route lies instead in imposing low-rank constraints on the feasible set of couplings considered in OT problems, with no approximations on cost nor kernel matrices. This route was first explored by Forrow et al., 2018, who proposed an algorithm tailored for the squared Euclidean ground cost, using a proxy objective that can be solved through the machinery of regularized 2-Wasserstein barycenters. Building on this, we introduce in this work a generic approach that aims at solving, in full generality, the OT problem under low-rank constraints with arbitrary costs. Our algorithm relies on an explicit factorization of low rank couplings as a product of \textit{sub-coupling} factors linked by a common marginal; similar to an NMF approach, we alternatively updates these factors. We prove the non-asymptotic stationary convergence of this algorithm and illustrate its efficiency on benchmark experiments.
Network embedding aims to learn low-dimensional representations of nodes in a network, while the network structure and inherent properties are preserved. It has attracted tremendous attention recently due to significant progress in downstream network learning tasks, such as node classification, link prediction, and visualization. However, most existing network embedding methods suffer from the expensive computations due to the large volume of networks. In this paper, we propose a $10\times \sim 100\times$ faster network embedding method, called Progle, by elegantly utilizing the sparsity property of online networks and spectral analysis. In Progle, we first construct a \textit{sparse} proximity matrix and train the network embedding efficiently via sparse matrix decomposition. Then we introduce a network propagation pattern via spectral analysis to incorporate local and global structure information into the embedding. Besides, this model can be generalized to integrate network information into other insufficiently trained embeddings at speed. Benefiting from sparse spectral network embedding, our experiment on four different datasets shows that Progle outperforms or is comparable to state-of-the-art unsupervised comparison approaches---DeepWalk, LINE, node2vec, GraRep, and HOPE, regarding accuracy, while is $10\times$ faster than the fastest word2vec-based method. Finally, we validate the scalability of Progle both in real large-scale networks and multiple scales of synthetic networks.
We develop an approach to risk minimization and stochastic optimization that provides a convex surrogate for variance, allowing near-optimal and computationally efficient trading between approximation and estimation error. Our approach builds off of techniques for distributionally robust optimization and Owen's empirical likelihood, and we provide a number of finite-sample and asymptotic results characterizing the theoretical performance of the estimator. In particular, we show that our procedure comes with certificates of optimality, achieving (in some scenarios) faster rates of convergence than empirical risk minimization by virtue of automatically balancing bias and variance. We give corroborating empirical evidence showing that in practice, the estimator indeed trades between variance and absolute performance on a training sample, improving out-of-sample (test) performance over standard empirical risk minimization for a number of classification problems.