Screening methods are useful tools for variable selection in regression analysis when the number of predictors is much larger than the sample size. Factor analysis is used to eliminate multicollinearity among predictors, which improves the variable selection performance. We propose a new method, called Truncated Preconditioned Profiled Independence Screening (TPPIS), that better selects the number of factors to eliminate multicollinearity. The proposed method improves the variable selection performance by truncating unnecessary parts from the information obtained by factor analysis. We confirmed the superior performance of the proposed method in variable selection through analysis using simulation data and real datasets.
We provide a framework for the numerical approximation of distributed optimal control problems, based on least-squares finite element methods. Our proposed method simultaneously solves the state and adjoint equations and is $\inf$--$\sup$ stable for any choice of conforming discretization spaces. A reliable and efficient a posteriori error estimator is derived for problems where box constraints are imposed on the control. It can be localized and therefore used to steer an adaptive algorithm. For unconstrained optimal control problems, i.e., the set of controls being a Hilbert space, we obtain a coercive least-squares method and, in particular, quasi-optimality for any choice of discrete approximation space. For constrained problems we derive and analyze a variational inequality where the PDE part is tackled by least-squares finite element methods. We show that the abstract framework can be applied to a wide range of problems, including scalar second-order PDEs, the Stokes problem, and parabolic problems on space-time domains. Numerical examples for some selected problems are presented.
High-dimensional Partial Differential Equations (PDEs) are a popular mathematical modelling tool, with applications ranging from finance to computational chemistry. However, standard numerical techniques for solving these PDEs are typically affected by the curse of dimensionality. In this work, we tackle this challenge while focusing on stationary diffusion equations defined over a high-dimensional domain with periodic boundary conditions. Inspired by recent progress in sparse function approximation in high dimensions, we propose a new method called compressive Fourier collocation. Combining ideas from compressive sensing and spectral collocation, our method replaces the use of structured collocation grids with Monte Carlo sampling and employs sparse recovery techniques, such as orthogonal matching pursuit and $\ell^1$ minimization, to approximate the Fourier coefficients of the PDE solution. We conduct a rigorous theoretical analysis showing that the approximation error of the proposed method is comparable with the best $s$-term approximation (with respect to the Fourier basis) to the solution. Using the recently introduced framework of random sampling in bounded Riesz systems, our analysis shows that the compressive Fourier collocation method mitigates the curse of dimensionality with respect to the number of collocation points under sufficient conditions on the regularity of the diffusion coefficient. We also present numerical experiments that illustrate the accuracy and stability of the method for the approximation of sparse and compressible solutions.
Given samples from two non-negative random variables, we propose a new class of nonparametric tests for the null hypothesis that one random variable dominates the other with respect to second-order stochastic dominance. These tests are based on the Lorenz P-P plot (LPP), which is the composition between the inverse unscaled Lorenz curve of one distribution and the unscaled Lorenz curve of the other. The LPP exceeds the identity function if and only if the dominance condition is violated, providing a rather simple method to construct test statistics, given by functionals defined over the difference between the identity and the LPP. We determine a stochastic upper bound for such test statistics under the null hypothesis, and derive its limit distribution, to be approximated via bootstrap procedures. We also establish the asymptotic validity of the tests under relatively mild conditions, allowing for both dependent and independent samples. Finally, finite sample properties are investigated through simulation studies.
Learning distance functions between complex objects, such as the Wasserstein distance to compare point sets, is a common goal in machine learning applications. However, functions on such complex objects (e.g., point sets and graphs) are often required to be invariant to a wide variety of group actions e.g. permutation or rigid transformation. Therefore, continuous and symmetric product functions (such as distance functions) on such complex objects must also be invariant to the product of such group actions. We call these functions symmetric and factor-wise group invariant (or SFGI functions in short). In this paper, we first present a general neural network architecture for approximating SFGI functions. The main contribution of this paper combines this general neural network with a sketching idea to develop a specific and efficient neural network which can approximate the $p$-th Wasserstein distance between point sets. Very importantly, the required model complexity is independent of the sizes of input point sets. On the theoretical front, to the best of our knowledge, this is the first result showing that there exists a neural network with the capacity to approximate Wasserstein distance with bounded model complexity. Our work provides an interesting integration of sketching ideas for geometric problems with universal approximation of symmetric functions. On the empirical front, we present a range of results showing that our newly proposed neural network architecture performs comparatively or better than other models (including a SOTA Siamese Autoencoder based approach). In particular, our neural network generalizes significantly better and trains much faster than the SOTA Siamese AE. Finally, this line of investigation could be useful in exploring effective neural network design for solving a broad range of geometric optimization problems (e.g., $k$-means in a metric space).
The proliferation of data generation has spurred advancements in functional data analysis. With the ability to analyze multiple variables simultaneously, the demand for working with multivariate functional data has increased. This study proposes a novel formulation of the epigraph and hypograph indexes, as well as their generalized expressions, specifically tailored for the multivariate functional context. These definitions take into account the interrelations between components. Furthermore, the proposed indexes are employed to cluster multivariate functional data. In the clustering process, the indexes are applied to both the data and their first and second derivatives. This generates a reduced-dimension dataset from the original multivariate functional data, enabling the application of well-established multivariate clustering techniques that have been extensively studied in the literature. This methodology has been tested through simulated and real datasets, performing comparative analyses against state-of-the-art to assess its performance.
We propose a new concept of codivergence, which quantifies the similarity between two probability measures $P_1, P_2$ relative to a reference probability measure $P_0$. In the neighborhood of the reference measure $P_0$, a codivergence behaves like an inner product between the measures $P_1 - P_0$ and $P_2 - P_0$. Codivergences of covariance-type and correlation-type are introduced and studied with a focus on two specific correlation-type codivergences, the $\chi^2$-codivergence and the Hellinger codivergence. We derive explicit expressions for several common parametric families of probability distributions. For a codivergence, we introduce moreover the divergence matrix as an analogue of the Gram matrix. It is shown that the $\chi^2$-divergence matrix satisfies a data-processing inequality.
A new method of representing graph projections in computer memory is proposed, which is more informative than matrix and list data structures based on elementary relations of vertices adjacency or edge incidences. The class of graphs considered in this study is expanded to include mixed graphs containing both undirected and directed edges (arcs). A new method for searching the shortest routes based on this approach is also proposed. The results of the general and special (for compact graphs) analysis of the asymptotic complexity of this method in solving typical SPP problems (Shortest Path Problem) show that the developed method is highly efficient and will find its application not only in information networks, where there are particularly high requirements for the topology of computing systems and the efficiency of finding shortest routes, but also in other scientific, technical, transport and economic fields.
The stochastic partial differential equation (SPDE) approach is widely used for modeling large spatial datasets. It is based on representing a Gaussian random field $u$ on $\mathbb{R}^d$ as the solution of an elliptic SPDE $L^\beta u = \mathcal{W}$ where $L$ is a second-order differential operator, $2\beta$ (belongs to natural number starting from 1) is a positive parameter that controls the smoothness of $u$ and $\mathcal{W}$ is Gaussian white noise. A few approaches have been suggested in the literature to extend the approach to allow for any smoothness parameter satisfying $\beta>d/4$. Even though those approaches work well for simulating SPDEs with general smoothness, they are less suitable for Bayesian inference since they do not provide approximations which are Gaussian Markov random fields (GMRFs) as in the original SPDE approach. We address this issue by proposing a new method based on approximating the covariance operator $L^{-2\beta}$ of the Gaussian field $u$ by a finite element method combined with a rational approximation of the fractional power. This results in a numerically stable GMRF approximation which can be combined with the integrated nested Laplace approximation (INLA) method for fast Bayesian inference. A rigorous convergence analysis of the method is performed and the accuracy of the method is investigated with simulated data. Finally, we illustrate the approach and corresponding implementation in the R package rSPDE via an application to precipitation data which is analyzed by combining the rSPDE package with the R-INLA software for full Bayesian inference.
Conformal prediction is an assumption-lean approach to generating distribution-free prediction intervals or sets, for nearly arbitrary predictive models, with guaranteed finite-sample coverage. Conformal methods are an active research topic in statistics and machine learning, but only recently have they been extended to non-exchangeable data. In this paper, we invite survey methodologists to begin using and contributing to conformal methods. We introduce how conformal prediction can be applied to data from several common complex sample survey designs, under a framework of design-based inference for a finite population, and we point out gaps where survey methodologists could fruitfully apply their expertise. Our simulations empirically bear out the theoretical guarantees of finite-sample coverage, and our real-data example demonstrates how conformal prediction can be applied to complex sample survey data in practice.
Hashing has been widely used in approximate nearest search for large-scale database retrieval for its computation and storage efficiency. Deep hashing, which devises convolutional neural network architecture to exploit and extract the semantic information or feature of images, has received increasing attention recently. In this survey, several deep supervised hashing methods for image retrieval are evaluated and I conclude three main different directions for deep supervised hashing methods. Several comments are made at the end. Moreover, to break through the bottleneck of the existing hashing methods, I propose a Shadow Recurrent Hashing(SRH) method as a try. Specifically, I devise a CNN architecture to extract the semantic features of images and design a loss function to encourage similar images projected close. To this end, I propose a concept: shadow of the CNN output. During optimization process, the CNN output and its shadow are guiding each other so as to achieve the optimal solution as much as possible. Several experiments on dataset CIFAR-10 show the satisfying performance of SRH.