Density-based clustering algorithms are widely used for discovering clusters in pattern recognition and machine learning since they can deal with non-hyperspherical clusters and are robustness to handle outliers. However, the runtime of density-based algorithms are heavily dominated by finding fixed-radius near neighbors and calculating the density, which is time-consuming. Meanwhile, the traditional acceleration methods using indexing technique such as KD tree is not effective in processing high-dimensional data. In this paper, we propose a fast region query algorithm named fast principal component analysis pruning (called FPCAP) with the help of the fast principal component analysis technique in conjunction with geometric information provided by principal attributes of the data, which can process high-dimensional data and be easily applied to density-based methods to prune unnecessary distance calculations when finding neighbors and estimating densities. As an application in density-based clustering methods, FPCAP method was combined with the Density Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm. And then, an improved DBSCAN (called IDBSCAN) is obtained, which preserves the advantage of DBSCAN and meanwhile, greatly reduces the computation of redundant distances. Experiments on seven benchmark datasets demonstrate that the proposed algorithm improves the computational efficiency significantly.
Truncated densities are probability density functions defined on truncated domains. They share the same parametric form with their non-truncated counterparts up to a normalizing constant. Since the computation of their normalizing constants is usually infeasible, Maximum Likelihood Estimation cannot be easily applied to estimate truncated density models. Score Matching (SM) is a powerful tool for fitting parameters using only unnormalized models. However, it cannot be directly applied here as boundary conditions used to derive a tractable SM objective are not satisfied by truncated densities. In this paper, we study parameter estimation for truncated probability densities using SM. The estimator minimizes a weighted Fisher divergence. The weight function is simply the shortest distance from a data point to the boundary of the domain. We show this choice of weight function naturally arises from minimizing the Stein discrepancy as well as upperbounding the finite-sample estimation error. The usefulness of our method is demonstrated by numerical experiments and a study on the Chicago crime data set. We also show that the proposed density estimation can correct the outlier-trimming bias caused by aggressive outlier detection methods.
We introduce a family of pairwise stochastic gradient estimators for gradients of expectations, which are related to the log-derivative trick, but involve pairwise interactions between samples. The simplest example of our new estimator, dubbed the fundamental trick estimator, is shown to arise from either a) introducing and approximating an integral representation based on the fundamental theorem of calculus, or b) applying the reparameterisation trick to an implicit parameterisation under infinitesimal perturbation of the parameters. From the former perspective we generalise to a reproducing kernel Hilbert space representation, giving rise to a locality parameter in the pairwise interactions mentioned above, yielding our representer trick estimator. The resulting estimators are unbiased and shown to offer an independent component of useful information in comparison with the log-derivative estimator. We provide a further novel theoretical analysis which further characterises the variance reduction afforded by the new techniques. Promising analytical and numerical examples confirm the theory and intuitions behind the new estimators.
Covariance estimation for matrix-valued data has received an increasing interest in applications. Unlike previous works that rely heavily on matrix normal distribution assumption and the requirement of fixed matrix size, we propose a class of distribution-free regularized covariance estimation methods for high-dimensional matrix data under a separability condition and a bandable covariance structure. Under these conditions, the original covariance matrix is decomposed into a Kronecker product of two bandable small covariance matrices representing the variability over row and column directions. We formulate a unified framework for estimating bandable covariance, and introduce an efficient algorithm based on rank one unconstrained Kronecker product approximation. The convergence rates of the proposed estimators are established, and the derived minimax lower bound shows our proposed estimator is rate-optimal under certain divergence regimes of matrix size. We further introduce a class of robust covariance estimators and provide theoretical guarantees to deal with heavy-tailed data. We demonstrate the superior finite-sample performance of our methods using simulations and real applications from a gridded temperature anomalies dataset and a S&P 500 stock data analysis.
We investigate the feature compression of high-dimensional ridge regression using the optimal subsampling technique. Specifically, based on the basic framework of random sampling algorithm on feature for ridge regression and the A-optimal design criterion, we first obtain a set of optimal subsampling probabilities. Considering that the obtained probabilities are uneconomical, we then propose the nearly optimal ones. With these probabilities, a two step iterative algorithm is established which has lower computational cost and higher accuracy. We provide theoretical analysis and numerical experiments to support the proposed methods. Numerical results demonstrate the decent performance of our methods.
This paper proposes a numerical method based on the Adomian decomposition approach for the time discretization, applied to Euler equations. A recursive property is demonstrated that allows to formulate the method in an appropriate and efficient way. To obtain a fully numerical scheme, the space discretization is achieved using the classical DG techniques. The efficiency of the obtained numerical scheme is demonstrated through numerical tests by comparison to exact solution and the popular Runge-Kutta DG method results.
The geometric high-order regularization methods such as mean curvature and Gaussian curvature, have been intensively studied during the last decades due to their abilities in preserving geometric properties including image edges, corners, and image contrast. However, the dilemma between restoration quality and computational efficiency is an essential roadblock for high-order methods. In this paper, we propose fast multi-grid algorithms for minimizing both mean curvature and Gaussian curvature energy functionals without sacrificing the accuracy for efficiency. Unlike the existing approaches based on operator splitting and the Augmented Lagrangian method (ALM), no artificial parameters are introduced in our formulation, which guarantees the robustness of the proposed algorithm. Meanwhile, we adopt the domain decomposition method to promote parallel computing and use the fine-to-coarse structure to accelerate the convergence. Numerical experiments are presented on both image denoising and CT reconstruction problem to demonstrate the ability to recover image texture and the efficiency of the proposed method.
Kernel smooth is the most fundamental technique for data density and regression estimation. However, time-consuming is the biggest obstacle for the application that the direct evaluation of kernel smooth for $N$ samples needs ${O}\left( {{N}^{2}} \right)$ operations. People have developed fast smooth algorithms using the idea of binning with FFT. Unfortunately, the accuracy is not controllable, and the implementation for multivariable and its bandwidth selection for the fast method is not available. Hence, we introduce a new MATLAB toolbox for fast multivariate kernel regression with the idea of non-uniform FFT (NUFFT), which implemented the algorithm for $M$ gridding points with ${O}\left( N+M\log M \right)$ complexity and accuracy controllability. The bandwidth selection problem utilizes the Fast Monte-Carlo algorithm to estimate the degree of freedom (DF), saving enormous cross-validation time even better when data share the same grid space for multiple regression. Up to now, this is the first toolbox for fast-binning high-dimensional kernel regression. Moreover, the estimation for local polynomial regression, the conditional variance for the heteroscedastic model, and the complex-valued datasets are also implemented in this toolbox. The performance is demonstrated with simulations and an application on the quantitive EEG.
One of the most important problems in system identification and statistics is how to estimate the unknown parameters of a given model. Optimization methods and specialized procedures, such as Empirical Minimization (EM) can be used in case the likelihood function can be computed. For situations where one can only simulate from a parametric model, but the likelihood is difficult or impossible to evaluate, a technique known as the Two-Stage (TS) Approach can be applied to obtain reliable parametric estimates. Unfortunately, there is currently a lack of theoretical justification for TS. In this paper, we propose a statistical decision-theoretical derivation of TS, which leads to Bayesian and Minimax estimators. We also show how to apply the TS approach on models for independent and identically distributed samples, by computing quantiles of the data as a first step, and using a linear function as the second stage. The proposed method is illustrated via numerical simulations.
We present a new sublinear time algorithm for approximating the spectral density (eigenvalue distribution) of an $n\times n$ normalized graph adjacency or Laplacian matrix. The algorithm recovers the spectrum up to $\epsilon$ accuracy in the Wasserstein-1 distance in $O(n\cdot \text{poly}(1/\epsilon))$ time given sample access to the graph. This result compliments recent work by David Cohen-Steiner, Weihao Kong, Christian Sohler, and Gregory Valiant (2018), which obtains a solution with runtime independent of $n$, but exponential in $1/\epsilon$. We conjecture that the trade-off between dimension dependence and accuracy is inherent. Our method is simple and works well experimentally. It is based on a Chebyshev polynomial moment matching method that employees randomized estimators for the matrix trace. We prove that, for any Hermitian $A$, this moment matching method returns an $\epsilon$ approximation to the spectral density using just $O({1}/{\epsilon})$ matrix-vector products with $A$. By leveraging stability properties of the Chebyshev polynomial three-term recurrence, we then prove that the method is amenable to the use of coarse approximate matrix-vector products. Our sublinear time algorithm follows from combining this result with a novel sampling algorithm for approximating matrix-vector products with a normalized graph adjacency matrix. Of independent interest, we show a similar result for the widely used \emph{kernel polynomial method} (KPM), proving that this practical algorithm nearly matches the theoretical guarantees of our moment matching method. Our analysis uses tools from Jackson's seminal work on approximation with positive polynomial kernels.
Sparse decision tree optimization has been one of the most fundamental problems in AI since its inception and is a challenge at the core of interpretable machine learning. Sparse decision tree optimization is computationally hard, and despite steady effort since the 1960's, breakthroughs have only been made on the problem within the past few years, primarily on the problem of finding optimal sparse decision trees. However, current state-of-the-art algorithms often require impractical amounts of computation time and memory to find optimal or near-optimal trees for some real-world datasets, particularly those having several continuous-valued features. Given that the search spaces of these decision tree optimization problems are massive, can we practically hope to find a sparse decision tree that competes in accuracy with a black box machine learning model? We address this problem via smart guessing strategies that can be applied to any optimal branch-and-bound-based decision tree algorithm. We show that by using these guesses, we can reduce the run time by multiple orders of magnitude, while providing bounds on how far the resulting trees can deviate from the black box's accuracy and expressive power. Our approach enables guesses about how to bin continuous features, the size of the tree, and lower bounds on the error for the optimal decision tree. Our experiments show that in many cases we can rapidly construct sparse decision trees that match the accuracy of black box models. To summarize: when you are having trouble optimizing, just guess.