In this article we propose a boosting algorithm for regression with functional explanatory variables and scalar responses. The algorithm uses decision trees constructed with multiple projections as the "base-learners", which we call "functional multi-index trees". We establish identifiability conditions for these trees and introduce two algorithms to compute them: one finds optimal projections over the entire tree, while the other one searches for a single optimal projection at each split. We use numerical experiments to investigate the performance of our method and compare it with several linear and nonlinear regression estimators, including recently proposed nonparametric and semiparametric functional additive estimators. Simulation studies show that the proposed method is consistently among the top performers, whereas the performance of any competitor relative to others can vary substantially across different settings. In a real example, we apply our method to predict electricity demand using price curves and show that our estimator provides better predictions compared to its competitors, especially when one adjusts for seasonality.
We consider the "all-for-one" decentralized learning problem for generalized linear models. The features of each sample are partitioned among several collaborating agents in a connected network, but only one agent observes the response variables. To solve the regularized empirical risk minimization in this distributed setting, we apply the Chambolle--Pock primal--dual algorithm to an equivalent saddle-point formulation of the problem. The primal and dual iterations are either in closed-form or reduce to coordinate-wise minimization of scalar convex functions. We establish convergence rates for the empirical risk minimization under two different assumptions on the loss function (Lipschitz and square root Lipschitz), and show how they depend on the characteristics of the design matrix and the Laplacian of the network.
In data analysis problems where we are not able to rely on distributional assumptions, what types of inference guarantees can still be obtained? Many popular methods, such as holdout methods, cross-validation methods, and conformal prediction, are able to provide distribution-free guarantees for predictive inference, but the problem of providing inference for the underlying regression function (for example, inference on the conditional mean $\mathbb{E}[Y|X]$) is more challenging. In the setting where the features $X$ are continuously distributed, recent work has established that any confidence interval for $\mathbb{E}[Y|X]$ must have non-vanishing width, even as sample size tends to infinity. At the other extreme, if $X$ takes only a small number of possible values, then inference on $\mathbb{E}[Y|X]$ is trivial to achieve. In this work, we study the problem in settings in between these two extremes. We find that there are several distinct regimes in between the finite setting and the continuous setting, where vanishing-width confidence intervals are achievable if and only if the effective support size of the distribution of $X$ is smaller than the square of the sample size.
The design of a metric between probability distributions is a longstanding problem motivated by numerous applications in Machine Learning. Focusing on continuous probability distributions on the Euclidean space $\mathbb{R}^d$, we introduce a novel pseudo-metric between probability distributions by leveraging the extension of univariate quantiles to multivariate spaces. Data depth is a nonparametric statistical tool that measures the centrality of any element $x\in\mathbb{R}^d$ with respect to (w.r.t.) a probability distribution or a data set. It is a natural median-oriented extension of the cumulative distribution function (cdf) to the multivariate case. Thus, its upper-level sets -- the depth-trimmed regions -- give rise to a definition of multivariate quantiles. The new pseudo-metric relies on the average of the Hausdorff distance between the depth-based quantile regions w.r.t. each distribution. Its good behavior w.r.t. major transformation groups, as well as its ability to factor out translations, are depicted. Robustness, an appealing feature of this pseudo-metric, is studied through the finite sample breakdown point. Moreover, we propose an efficient approximation method with linear time complexity w.r.t. the size of the data set and its dimension. The quality of this approximation as well as the performance of the proposed approach are illustrated in numerical experiments.
There are proposals that extend the classical generalized additive models (GAMs) to accommodate high-dimensional data ($p>>n$) using group sparse regularization. However, the sparse regularization may induce excess shrinkage when estimating smoothing functions, damaging predictive performance. Moreover, most of these GAMs consider an "all-in-all-out" approach for functional selection, rendering them difficult to answer if nonlinear effects are necessary. While some Bayesian models can address these shortcomings, using Markov chain Monte Carlo algorithms for model fitting creates a new challenge, scalability. Hence, we propose Bayesian hierarchical generalized additive models as a solution: we consider the smoothing penalty for proper shrinkage of curve interpolation and separation of smoothing function linear and nonlinear spaces. A novel spike-and-slab spline prior is proposed to select components of smoothing functions. Two scalable and deterministic algorithms, EM-Coordinate Descent and EM-Iterative Weighted Least Squares, are developed for different utilities. Simulation studies and metabolomics data analyses demonstrate improved predictive or computational performance against state-of-the-art models, mgcv, COSSO and sparse Bayesian GAM. The software implementation of the proposed models is freely available via an R package BHAM.
We develop a dimension reduction framework for data consisting of matrices of counts. Our model is based on assuming the existence of a small amount of independent normal latent variables that drive the dependency structure of the observed data, and can be seen as the exact discrete analogue for a contaminated low-rank matrix normal model. We derive estimators for the model parameters and establish their root-$n$ consistency. An extension of a recent proposal from the literature is used to estimate the latent dimension of the model. Additionally, a sparsity-accommodating variant of the model is considered. The method is shown to surpass both its vectorization-based competitors and matrix methods assuming the continuity of the data distribution in analysing simulated data and real abundance data.
Though data augmentation has rapidly emerged as a key tool for optimization in modern machine learning, a clear picture of how augmentation schedules affect optimization and interact with optimization hyperparameters such as learning rate is nascent. In the spirit of classical convex optimization and recent work on implicit bias, the present work analyzes the effect of augmentation on optimization in the simple convex setting of linear regression with MSE loss. We find joint schedules for learning rate and data augmentation scheme under which augmented gradient descent provably converges and characterize the resulting minimum. Our results apply to arbitrary augmentation schemes, revealing complex interactions between learning rates and augmentations even in the convex setting. Our approach interprets augmented (S)GD as a stochastic optimization method for a time-varying sequence of proxy losses. This gives a unified way to analyze learning rate, batch size, and augmentations ranging from additive noise to random projections. From this perspective, our results, which also give rates of convergence, can be viewed as Monro-Robbins type conditions for augmented (S)GD.
Automated decision-making tools increasingly assess individuals to determine if they qualify for high-stakes opportunities. A recent line of research investigates how strategic agents may respond to such scoring tools to receive favorable assessments. While prior work has focused on the short-term strategic interactions between a decision-making institution (modeled as a principal) and individual decision-subjects (modeled as agents), we investigate interactions spanning multiple time-steps. In particular, we consider settings in which the agent's effort investment today can accumulate over time in the form of an internal state - impacting both his future rewards and that of the principal. We characterize the Stackelberg equilibrium of the resulting game and provide novel algorithms for computing it. Our analysis reveals several intriguing insights about the role of multiple interactions in shaping the game's outcome: First, we establish that in our stateful setting, the class of all linear assessment policies remains as powerful as the larger class of all monotonic assessment policies. While recovering the principal's optimal policy requires solving a non-convex optimization problem, we provide polynomial-time algorithms for recovering both the principal and agent's optimal policies under common assumptions about the process by which effort investments convert to observable features. Most importantly, we show that with multiple rounds of interaction at her disposal, the principal is more effective at incentivizing the agent to accumulate effort in her desired direction. Our work addresses several critical gaps in the growing literature on the societal impacts of automated decision-making - by focusing on longer time horizons and accounting for the compounding nature of decisions individuals receive over time.
In the first part of this work, we develop a novel scheme for solving nonparametric regression problems. That is the approximation of possibly low regular and noised functions from the knowledge of their approximate values given at some random points. Our proposed scheme is based on the use of the pseudo-inverse of a random projection matrix, combined with some specific properties of the Jacobi polynomials system, as well as some properties of positive definite random matrices. This scheme has the advantages to be stable, robust, accurate and fairly fast in terms of execution time. In particular, we provide an $L_2$ as well as an $L_2-$risk errors of our proposed nonparametric regression estimator. Moreover and unlike most of the existing nonparametric regression estimators, no extra regularization step is required by our proposed estimator. Although, this estimator is initially designed to work with random sampling set of uni-variate i.i.d. random variables following a Beta distribution, we show that it is still works for a wide range of sampling distribution laws. Moreover, we briefly describe how our estimator can be adapted in order to handle the multivariate case of random sampling sets. In the second part of this work, we extend the random pseudo-inverse scheme technique to build a stable and accurate estimator for solving linear functional regression (LFR) problems. A dyadic decomposition approach is used to construct this last stable estimator for the LFR problem. Alaso, we give an $L_2-$risk error of our proposed LFR estimator. Finally, the performance of the two proposed estimators are illustrated by various numerical simulations. In particular, a real dataset is used to illustrate the performance of our nonparametric regression estimator.
In this paper several related estimation problems are addressed from a Bayesian point of view and optimal estimators are obtained for each of them when some natural loss functions are considered. Namely, we are interested in estimating a regression curve. Simultaneously, the estimation problems of a conditional distribution function, or a conditional density, or even the conditional distribution itself, are considered. All these problems are posed in a sufficiently general framework to cover continuous and discrete, univariate and multivariate, parametric and non-parametric cases, without the need to use a specific prior distribution. The loss functions considered come naturally from the quadratic error loss function comonly used in estimating a real function of the unknown parameter. The cornerstone of the mentioned Bayes estimators is the posterior predictive distribution. Some examples are provided to illustrate these results.
In this paper, we investigate the challenges of using reinforcement learning agents for question-answering over knowledge graphs for real-world applications. We examine the performance metrics used by state-of-the-art systems and determine that they are inadequate for such settings. More specifically, they do not evaluate the systems correctly for situations when there is no answer available and thus agents optimized for these metrics are poor at modeling confidence. We introduce a simple new performance metric for evaluating question-answering agents that is more representative of practical usage conditions, and optimize for this metric by extending the binary reward structure used in prior work to a ternary reward structure which also rewards an agent for not answering a question rather than giving an incorrect answer. We show that this can drastically improve the precision of answered questions while only not answering a limited number of previously correctly answered questions. Employing a supervised learning strategy using depth-first-search paths to bootstrap the reinforcement learning algorithm further improves performance.