Partially linear additive models generalize the linear models since they model the relation between a response variable and covariates by assuming that some covariates are supposed to have a linear relation with the response but each of the others enter with unknown univariate smooth functions. The harmful effect of outliers either in the residuals or in the covariates involved in the linear component has been described in the situation of partially linear models, that is, when only one nonparametric component is involved in the model. When dealing with additive components, the problem of providing reliable estimators when atypical data arise, is of practical importance motivating the need of robust procedures. Hence, we propose a family of robust estimators for partially linear additive models by combining $B-$splines with robust linear regression estimators. We obtain consistency results, rates of convergence and asymptotic normality for the linear components, under mild assumptions. A Monte Carlo study is carried out to compare the performance of the robust proposal with its classical counterpart under different models and contamination schemes. The numerical experiments show the advantage of the proposed methodology for finite samples. We also illustrate the usefulness of the proposed approach on a real data set.
In this paper, we consider a class of possibly nonconvex, nonsmooth and non-Lipschitz optimization problems arising in many contemporary applications such as machine learning, variable selection and image processing. To solve this class of problems, we propose a proximal gradient method with extrapolation and line search (PGels). This method is developed based on a special potential function and successfully incorporates both extrapolation and non-monotone line search, which are two simple and efficient accelerating techniques for the proximal gradient method. Thanks to the line search, this method allows more flexibilities in choosing the extrapolation parameters and updates them adaptively at each iteration if a certain line search criterion is not satisfied. Moreover, with proper choices of parameters, our PGels reduces to many existing algorithms. We also show that, under some mild conditions, our line search criterion is well defined and any cluster point of the sequence generated by PGels is a stationary point of our problem. In addition, by assuming the Kurdyka-${\L}$ojasiewicz exponent of the objective in our problem, we further analyze the local convergence rate of two special cases of PGels, including the widely used non-monotone proximal gradient method as one case. Finally, we conduct some numerical experiments for solving the $\ell_1$ regularized logistic regression problem and the $\ell_{1\text{-}2}$ regularized least squares problem. Our numerical results illustrate the efficiency of PGels and show the potential advantage of combining two accelerating techniques.
We consider the Sparse Principal Component Analysis (SPCA) problem under the well-known spiked covariance model. Recent work has shown that the SPCA problem can be reformulated as a Mixed Integer Program (MIP) and can be solved to global optimality, leading to estimators that are known to enjoy optimal statistical properties. However, current MIP algorithms for SPCA are unable to scale beyond instances with a thousand features or so. In this paper, we propose a new estimator for SPCA which can be formulated as a MIP. Different from earlier work, we make use of the underlying spiked covariance model and properties of the multivariate Gaussian distribution to arrive at our estimator. We establish statistical guarantees for our proposed estimator in terms of estimation error and support recovery. We propose a custom algorithm to solve the MIP which is significantly more scalable than off-the-shelf solvers; and demonstrate that our approach can be much more computationally attractive compared to earlier exact MIP-based approaches for the SPCA problem. Our numerical experiments on synthetic and real datasets show that our algorithms can address problems with up to 20000 features in minutes; and generally result in favorable statistical properties compared to existing popular approaches for SPCA.
Stimulated by practical applications arising from viral marketing. This paper investigates a novel Budgeted $k$-Submodular Maximization problem defined as follows: Given a finite set $V$, a budget $B$ and a $k$-submodular function $f: (k+1)^V \mapsto \mathbb{R}_+$, the problem asks to find a solution $\s=(S_1, S_2, \ldots, S_k)$, each element $e \in V$ has a cost $c_i(e)$ to be put into $i$-th set $S_i$, with the total cost of $s$ does not exceed $B$ so that $f(\s)$ is maximized. To address this problem, we propose two streaming algorithms that provide approximation guarantees for the problem. In particular, in the case of each element $e$ has the same cost for all $i$-th sets, we propose a deterministic streaming algorithm which provides an approximation ratio of $\frac{1}{4}-\epsilon$ when $f$ is monotone and $\frac{1}{5}-\epsilon$ when $f$ is non-monotone. For the general case, we propose a random streaming algorithm that provides an approximation ratio of $\min\{\frac{\alpha}{2}, \frac{(1-\alpha)k}{(1+\beta)k-\beta} \}-\epsilon$ when $f$ is monotone and $\min\{\frac{\alpha}{2}, \frac{(1-\alpha)k}{(1+2\beta)k-2\beta} \}-\epsilon$ when $f$ is non-monotone in expectation, where $\beta=\max_{e\in V, i , j \in [k], i\neq j} \frac{c_i(e)}{c_j(e)}$ and $\epsilon, \alpha$ are fixed inputs.
In many practical applications, such as fraud detection, credit risk modeling or medical decision making, classification models for assigning instances to a predefined set of classes are required to be both precise as well as interpretable. Linear modeling methods such as logistic regression are often adopted, since they offer an acceptable balance between precision and interpretability. Linear methods, however, are not well equipped to handle categorical predictors with high-cardinality or to exploit non-linear relations in the data. As a solution, data preprocessing methods such as weight-of-evidence are typically used for transforming the predictors. The binning procedure that underlies the weight-of-evidence approach, however, has been little researched and typically relies on ad-hoc or expert driven procedures. The objective in this paper, therefore, is to propose a formalized, data-driven and powerful method. To this end, we explore the discretization of continuous variables through the binning of spline functions, which allows for capturing non-linear effects in the predictor variables and yields highly interpretable predictors taking only a small number of discrete values. Moreover, we extend upon the weight-of-evidence approach and propose to estimate the proportions using shrinkage estimators. Together, this offers an improved ability to exploit both non-linear and categorical predictors for achieving increased classification precision, while maintaining interpretability of the resulting model and decreasing the risk of overfitting. We present the results of a series of experiments in a fraud detection setting, which illustrate the effectiveness of the presented approach. We facilitate reproduction of the presented results and adoption of the proposed approaches by providing both the dataset and the code for implementing the experiments and the presented approach.
This paper develops a unified and computationally efficient method for change-point inference in non-stationary spatio-temporal processes. By modeling a non-stationary spatio-temporal process as a piecewise stationary spatio-temporal process, we consider simultaneous estimation of the number and locations of change-points, and model parameters in each segment. A composite likelihood-based criterion is developed for change-point and parameters estimation. Under the framework of increasing domain asymptotics, theoretical results including consistency and distribution of the estimators are derived under mild conditions. In contrast to classical results in fixed dimensional time series that the localization error of change-point estimator is $O_{p}(1)$, exact recovery of true change-points can be achieved in the spatio-temporal setting. More surprisingly, the consistency of change-point estimation can be achieved without any penalty term in the criterion function. In addition, we further establish consistency of the number and locations of the change-point estimator under the infill asymptotics framework where the time domain is increasing while the spatial sampling domain is fixed. A computationally efficient pruned dynamic programming algorithm is developed for the challenging criterion optimization problem. Extensive simulation studies and an application to U.S. precipitation data are provided to demonstrate the effectiveness and practicality of the proposed method.
We consider n robots with limited visibility: each robot can observe other robots only up to a constant distance denoted as the viewing range. The robots operate in discrete rounds that are either fully synchronous (FSync) or semi-synchronized (SSync). Most previously studied formation problems in this setting seek to bring the robots closer together (e.g., Gathering or Chain-Formation). In this work, we introduce the Max-Line-Formation problem, which has a contrary goal: to arrange the robots on a straight line of maximal length. First, we prove that the problem is impossible to solve by robots with a constant sized circular viewing range. The impossibility holds under comparably strong assumptions: robots that agree on both axes of their local coordinate systems in FSync. On the positive side, we show that the problem is solvable by robots with a constant square viewing range, i.e., the robots can observe other robots that lie within a constant-sized square centered at their position. In this case, the robots need to agree on only one axis of their local coordinate systems. We derive two algorithms: the first algorithm considers oblivious robots and converges to the optimal configuration in time $\mathcal{O}(n^2 \cdot \log (n/\varepsilon))$ under the SSync scheduler. The other algorithm makes use of locally visible lights (LUMI). It is designed for the FSync scheduler and can solve the problem exactly in optimal time $\Theta(n)$. Afterward, we show that both the algorithmic and the analysis techniques can also be applied to the Gathering and Chain-Formation problem: we introduce an algorithm with a reduced viewing range for Gathering and give new and improved runtime bounds for the Chain-Formation problem.
This paper presents new existence, dual representation and approximation results for the information projection in the infinite-dimensional setting for moment inequality models. These results are established under a general specification of the moment inequality model, nesting both conditional and unconditional models, and allowing for an infinite number of such inequalities. An important innovation of the paper is the exhibition of the dual variable as a weak vector-valued integral to formulate an approximation scheme of the $I$-projection's equivalent Fenchel dual problem. In particular, it is shown under suitable assumptions that the dual problem's optimum value can be approximated by the values of finite-dimensional programs, and that, in addition, every accumulation point of a sequence of optimal solutions for the approximating programs is an optimal solution for the dual problem. This paper illustrates the verification of assumptions and the construction of the approximation scheme's parameters for the cases of unconditional and conditional first-order stochastic dominance constraints.
The availability of large microarray data has led to a growing interest in biclustering methods in the past decade. Several algorithms have been proposed to identify subsets of genes and conditions according to different similarity measures and under varying constraints. In this paper we focus on the exclusive row biclustering problem for gene expression data sets, in which each row can only be a member of a single bicluster while columns can participate in multiple ones. This type of biclustering may be adequate, for example, for clustering groups of cancer patients where each patient (row) is expected to be carrying only a single type of cancer, while each cancer type is associated with multiple (and possibly overlapping) genes (columns). We present a novel method to identify these exclusive row biclusters through a combination of existing biclustering algorithms and combinatorial auction techniques. We devise an approach for tuning the threshold for our algorithm based on comparison to a null model in the spirit of the Gap statistic approach. We demonstrate our approach on both synthetic and real-world gene expression data and show its power in identifying large span non-overlapping rows sub matrices, while considering their unique nature. The Gap statistic approach succeeds in identifying appropriate thresholds in all our examples.
Stochastic gradient Markov chain Monte Carlo (SGMCMC) has become a popular method for scalable Bayesian inference. These methods are based on sampling a discrete-time approximation to a continuous time process, such as the Langevin diffusion. When applied to distributions defined on a constrained space, such as the simplex, the time-discretisation error can dominate when we are near the boundary of the space. We demonstrate that while current SGMCMC methods for the simplex perform well in certain cases, they struggle with sparse simplex spaces; when many of the components are close to zero. However, most popular large-scale applications of Bayesian inference on simplex spaces, such as network or topic models, are sparse. We argue that this poor performance is due to the biases of SGMCMC caused by the discretization error. To get around this, we propose the stochastic CIR process, which removes all discretization error and we prove that samples from the stochastic CIR process are asymptotically unbiased. Use of the stochastic CIR process within a SGMCMC algorithm is shown to give substantially better performance for a topic model and a Dirichlet process mixture model than existing SGMCMC approaches.
We consider the task of learning the parameters of a {\em single} component of a mixture model, for the case when we are given {\em side information} about that component, we call this the "search problem" in mixture models. We would like to solve this with computational and sample complexity lower than solving the overall original problem, where one learns parameters of all components. Our main contributions are the development of a simple but general model for the notion of side information, and a corresponding simple matrix-based algorithm for solving the search problem in this general setting. We then specialize this model and algorithm to four common scenarios: Gaussian mixture models, LDA topic models, subspace clustering, and mixed linear regression. For each one of these we show that if (and only if) the side information is informative, we obtain parameter estimates with greater accuracy, and also improved computation complexity than existing moment based mixture model algorithms (e.g. tensor methods). We also illustrate several natural ways one can obtain such side information, for specific problem instances. Our experiments on real data sets (NY Times, Yelp, BSDS500) further demonstrate the practicality of our algorithms showing significant improvement in runtime and accuracy.