This paper presents GMASK, a general algorithm for distributed approximate similarity search that accepts any arbitrary distance function. GMASK requires a clustering algorithm that induces Voronoi regions in a dataset and returns a representative element for each region. Then, it creates a multilevel indexing structure suitable for large datasets with high dimensionality and sparsity, usually stored in distributed systems. Many similarity search algorithms rely on $k$-means, typically associated with the Euclidean distance, which is inappropriate for specific problems. Instead, in this work we implement GMASK using $k$-medoids to make it compatible with any distance and a wider range of problems. Experimental results verify the applicability of this method with real datasets, improving the performance of alternative algorithms for approximate similarity search. In addition, results confirm existing intuitions regarding the advantages of using certain instances of the Minkowski distance in high-dimensional datasets.
Efficiently enumerating all the extreme points of a polytope identified by a system of linear inequalities is a well-known challenge issue.We consider a special case and present an algorithm that enumerates all the extreme points of a bisubmodular polyhedron in $\mathcal{O}(n^4|V|)$ time and $\mathcal{O}(n^2)$ space complexity, where $ n$ is the dimension of underlying space and $V$ is the set of outputs. We use the reverse search and signed poset linked to extreme points to avoid the redundant search. Our algorithm is a generalization of enumerating all the extreme points of a base polyhedron which comprises some combinatorial enumeration problems.
We propose a new full discretization of the Biot's equations in poroelasticity. The construction is driven by the inf-sup theory, which we recently developed. It builds upon the four-field formulation of the equations obtained by introducing the total pressure and the total fluid content. We discretize in space with Lagrange finite elements and in time with backward Euler. We establish inf-sup stability and quasi-optimality of the proposed discretization, with robust constants with respect to all material parameters. We further construct an interpolant showing how the error decays for smooth solutions.
We investigate the estimation properties of the mixture of experts (MoE) model in a high-dimensional setting, where the number of predictors is much larger than the sample size, and for which the literature is particularly lacking in theoretical results. We consider the class of softmax-gated Gaussian MoE (SGMoE) models, defined as MoE models with softmax gating functions and Gaussian experts, and focus on the theoretical properties of their $l_1$-regularized estimation via the Lasso. To the best of our knowledge, we are the first to investigate the $l_1$-regularization properties of SGMoE models from a non-asymptotic perspective, under the mildest assumptions, namely the boundedness of the parameter space. We provide a lower bound on the regularization parameter of the Lasso penalty that ensures non-asymptotic theoretical control of the Kullback--Leibler loss of the Lasso estimator for SGMoE models. Finally, we carry out a simulation study to empirically validate our theoretical findings.
We show that the limiting variance of a sequence of estimators for a structured covariance matrix has a general form that appears as the variance of a scaled projection of a random matrix that is of radial type and a similar result is obtained for the corresponding sequence of estimators for the vector of variance components. These results are illustrated by the limiting behavior of estimators for a linear covariance structure in a variety of multivariate statistical models. We also derive a characterization for the influence function of corresponding functionals. Furthermore, we derive the limiting distribution and influence function of scale invariant mappings of such estimators and their corresponding functionals. As a consequence, the asymptotic relative efficiency of different estimators for the shape component of a structured covariance matrix can be compared by means of a single scalar and the gross error sensitivity of the corresponding influence functions can be compared by means of a single index. Similar results are obtained for estimators of the normalized vector of variance components. We apply our results to investigate how the efficiency, gross error sensitivity, and breakdown point of S-estimators for the normalized variance components are affected simultaneously by varying their cutoff value.
Column selection is an essential tool for structure-preserving low-rank approximation, with wide-ranging applications across many fields, such as data science, machine learning, and theoretical chemistry. In this work, we develop unified methodologies for fast, efficient, and theoretically guaranteed column selection. First we derive and implement a sparsity-exploiting deterministic algorithm applicable to tasks including kernel approximation and CUR decomposition. Next, we develop a matrix-free formalism relying on a randomization scheme satisfying guaranteed concentration bounds, applying this construction both to CUR decomposition and to the approximation of matrix functions of graph Laplacians. Importantly, the randomization is only relevant for the computation of the scores that we use for column selection, not the selection itself given these scores. For both deterministic and matrix-free algorithms, we bound the performance favorably relative to the expected performance of determinantal point process (DPP) sampling and, in select scenarios, that of exactly optimal subset selection. The general case requires new analysis of the DPP expectation. Finally, we demonstrate strong real-world performance of our algorithms on a diverse set of example approximation tasks.
The paper analyzes how the enlarging of the sample affects to the mitigation of collinearity concluding that it may mitigate the consequences of collinearity related to statistical analysis but not necessarily the numerical instability. The problem that is addressed is of importance in the teaching of social sciences since it discusses one of the solutions proposed almost unanimously to solve the problem of multicollinearity. For a better understanding and illustration of the contribution of this paper, two empirical examples are presented and not highly technical developments are used.
We study parametric estimation for a second order linear parabolic stochastic partial differential equation (SPDE) in two space dimensions driven by a $Q$-Wiener process based on high frequency spatio-temporal data. We give an estimator of the damping parameter of the $Q$-Wiener process of the SPDE based on quadratic variations with temporal and spatial increments. We also provide simulation results of the proposed estimator.
Mean value coordinates can be used to map one polygon into another, with application to computer graphics and curve and surface modelling. In this paper we show that if the polygons are quadrilaterals, and if the target quadrilateral is convex, then the mapping is injective.
Hashing has been widely used in approximate nearest search for large-scale database retrieval for its computation and storage efficiency. Deep hashing, which devises convolutional neural network architecture to exploit and extract the semantic information or feature of images, has received increasing attention recently. In this survey, several deep supervised hashing methods for image retrieval are evaluated and I conclude three main different directions for deep supervised hashing methods. Several comments are made at the end. Moreover, to break through the bottleneck of the existing hashing methods, I propose a Shadow Recurrent Hashing(SRH) method as a try. Specifically, I devise a CNN architecture to extract the semantic features of images and design a loss function to encourage similar images projected close. To this end, I propose a concept: shadow of the CNN output. During optimization process, the CNN output and its shadow are guiding each other so as to achieve the optimal solution as much as possible. Several experiments on dataset CIFAR-10 show the satisfying performance of SRH.
Graph representation learning for hypergraphs can be used to extract patterns among higher-order interactions that are critically important in many real world problems. Current approaches designed for hypergraphs, however, are unable to handle different types of hypergraphs and are typically not generic for various learning tasks. Indeed, models that can predict variable-sized heterogeneous hyperedges have not been available. Here we develop a new self-attention based graph neural network called Hyper-SAGNN applicable to homogeneous and heterogeneous hypergraphs with variable hyperedge sizes. We perform extensive evaluations on multiple datasets, including four benchmark network datasets and two single-cell Hi-C datasets in genomics. We demonstrate that Hyper-SAGNN significantly outperforms the state-of-the-art methods on traditional tasks while also achieving great performance on a new task called outsider identification. Hyper-SAGNN will be useful for graph representation learning to uncover complex higher-order interactions in different applications.