We present a geometric algorithm to compute the geometric kernel of a generic polyhedron. The geometric kernel (or simply kernel) is definedas the set of points from which the whole polyhedron is visible. Whilst the computation of the kernel for a polygon has already been largely addressed in the literature, less has been done for polyhedra. Currently, the principal implementation of the kernel estimation is based on the solution of a linear programming problem. We compare against it on several examples, showing that our method is more efficient in analysing the elements of a generic tessellation. Details on the technical implementation and discussions on pros and cons of the method are also provided.
We study algorithms for approximating pairwise similarity matrices that arise in natural language processing. Generally, computing a similarity matrix for $n$ data points requires $\Omega(n^2)$ similarity computations. This quadratic scaling is a significant bottleneck, especially when similarities are computed via expensive functions, e.g., via transformer models. Approximation methods reduce this quadratic complexity, often by using a small subset of exactly computed similarities to approximate the remainder of the complete pairwise similarity matrix. Significant work focuses on the efficient approximation of positive semidefinite (PSD) similarity matrices, which arise e.g., in kernel methods. However, much less is understood about indefinite (non-PSD) similarity matrices, which often arise in NLP. Motivated by the observation that many of these matrices are still somewhat close to PSD, we introduce a generalization of the popular Nystr\"{o}m method to the indefinite setting. Our algorithm can be applied to any similarity matrix and runs in sublinear time in the size of the matrix, producing a rank-$s$ approximation with just $O(ns)$ similarity computations. We show that our method, along with a simple variant of CUR decomposition, performs very well in approximating a variety of similarity matrices arising in NLP tasks. We demonstrate high accuracy of the approximated similarity matrices in the downstream tasks of document classification, sentence similarity, and cross-document coreference.
Additional grid points are often introduced for the higher-order polynomial of a numerical solution with curvilinear elements. However, those points are likely to be located slightly outside the domain, even when the vertices of the curvilinear elements lie within the curved domain. This misallocation of grid points generates a mesh error, called geometric approximation error. This error is smaller than the discretization error but large enough to significantly degrade a long-time integration. Moreover, this mesh error is considered to be the leading cause of conservation error. Two novel schemes are proposed to improve conservation error and/or discretization error for long-time integration caused by geometric approximation error: The first scheme retrieves the original divergence of the original domain; the second scheme reconstructs the original path of differentiation, called connection, thus retrieving the original connection. The increased accuracies of the proposed schemes are demonstrated by the conservation error for various partial differential equations with moving frames on the sphere.
We consider the problem of selecting confounders for adjustment from a potentially large set of covariates, when estimating a causal effect. Recently, the high-dimensional Propensity Score (hdPS) method was developed for this task; hdPS ranks potential confounders by estimating an importance score for each variable and selects the top few variables. However, this ranking procedure is limited: it requires all variables to be binary. We propose an extension of the hdPS to general types of response and confounder variables. We further develop a group importance score, allowing us to rank groups of potential confounders. The main challenge is that our parameter requires either the propensity score or response model; both vulnerable to model misspecification. We propose a targeted maximum likelihood estimator (TMLE) which allows the use of nonparametric, machine learning tools for fitting these intermediate models. We establish asymptotic normality of our estimator, which consequently allows constructing confidence intervals. We complement our work with numerical studies on simulated and real data.
During the training of networks for distance metric learning, minimizers of the typical loss functions can be considered as "feasible points" satisfying a set of constraints imposed by the training data. To this end, we reformulate distance metric learning problem as finding a feasible point of a constraint set where the embedding vectors of the training data satisfy desired intra-class and inter-class proximity. The feasible set induced by the constraint set is expressed as the intersection of the relaxed feasible sets which enforce the proximity constraints only for particular samples (a sample from each class) of the training data. Then, the feasible point problem is to be approximately solved by performing alternating projections onto those feasible sets. Such an approach introduces a regularization term and results in minimizing a typical loss function with a systematic batch set construction where these batches are constrained to contain the same sample from each class for a certain number of iterations. Moreover, these particular samples can be considered as the class representatives, allowing efficient utilization of hard class mining during batch construction. The proposed technique is applied with the well-accepted losses and evaluated on Stanford Online Products, CAR196 and CUB200-2011 datasets for image retrieval and clustering. Outperforming state-of-the-art, the proposed approach consistently improves the performance of the integrated loss functions with no additional computational cost and boosts the performance further by hard negative class mining.
Recently, the second and third author showed that complete geometric graphs on $2n$ vertices in general cannot be partitioned into $n$ plane spanning trees. Building up on this work, in this paper, we initiate the study of partitioning into beyond planar subgraphs, namely into $k$-planar and $k$-quasi-planar subgraphs and obtain first bounds on the number of subgraphs required in this setting.
We study a sound verification method for parametric component-based systems. The method uses a resource logic, a new formal specification language for distributed systems consisting of a finite yet unbounded number of components. The logic allows the description of architecture configurations coordinating instances of a finite number of types of components, by means of inductive definitions similar to the ones used to describe algebraic data types or recursive data structures. For parametric systems specified in this logic, we show that decision problems such as reaching deadlock or violating critical section are undecidable, in general. Despite this negative result, we provide for these decision problems practical semi-algorithms relying on the automatic synthesis of structural invariants allowing the proof of general safety properties. The invariants are defined using the WSkS fragment of the monadic second order logic, known to be decidable by a classical automata-logic connection, thus reducing a verification problem to checking satisfiability of a WSkS formula.
The canonical polyadic decomposition (CPD) of a low rank tensor plays a major role in data analysis and signal processing by allowing for unique recovery of underlying factors. However, it is well known that the low rank CPD approximation problem is ill-posed. That is, a tensor may fail to have a best rank $R$ CPD approximation when $R>1$. This article gives deterministic bounds for the existence of best low rank tensor approximations over $\mathbb{K}=\mathbb{R}$ or $\mathbb{K}=\mathbb{C}$. More precisely, given a tensor $\mathcal{T} \in \mathbb{K}^{I \times I \times I}$ of rank $R \leq I$, we compute the radius of a Frobenius norm ball centered at $\mathcal{T}$ in which best $\mathbb{K}$-rank $R$ approximations are guaranteed to exist. In addition we show that every $\mathbb{K}$-rank $R$ tensor inside of this ball has a unique canonical polyadic decomposition. This neighborhood may be interpreted as a neighborhood of "mathematical truth" in with CPD approximation and computation is well-posed. In pursuit of these bounds, we describe low rank tensor decomposition as a ``joint generalized eigenvalue" problem. Using this framework, we show that, under mild assumptions, a low rank tensor which has rank strictly greater than border rank is defective in the sense of algebraic and geometric multiplicities for joint generalized eigenvalues. Bounds for existence of best low rank approximations are then obtained by establishing perturbation theoretic results for the joint generalized eigenvalue problem. In this way we establish a connection between existence of best low rank approximations and the tensor spectral norm. In addition we solve a "tensor Procrustes problem" which examines orthogonal compressions for pairs of tensors. The main results of the article are illustrated by a variety of numerical experiments.
We propose a fast and efficient strategy, called the representative approach, for big data analysis with generalized linear models, especially for distributed data with localization requirements or limited network bandwidth. With a given partition of massive dataset, this approach constructs a representative data point for each data block and fits the target model using the representative dataset. In terms of time complexity, it is as fast as the subsampling approaches in the literature. As for efficiency, its accuracy in estimating parameters given a homogeneous partition is comparable with the divide-and-conquer method. Supported by comprehensive simulation studies and theoretical justifications, we conclude that mean representatives (MR) work fine for linear models or generalized linear models with a flat inverse link function and moderate coefficients of continuous predictors. For general cases, we recommend the proposed score-matching representatives (SMR), which may improve the accuracy of estimators significantly by matching the score function values. As an illustrative application to the Airline on-time performance data, we show that the MR and SMR estimates are as good as the full data estimate when available.
The problem of Approximate Nearest Neighbor (ANN) search is fundamental in computer science and has benefited from significant progress in the past couple of decades. However, most work has been devoted to pointsets whereas complex shapes have not been sufficiently treated. Here, we focus on distance functions between discretized curves in Euclidean space: they appear in a wide range of applications, from road segments to time-series in general dimension. For $\ell_p$-products of Euclidean metrics, for any $p$, we design simple and efficient data structures for ANN, based on randomized projections, which are of independent interest. They serve to solve proximity problems under a notion of distance between discretized curves, which generalizes both discrete Fr\'echet and Dynamic Time Warping distances. These are the most popular and practical approaches to comparing such curves. We offer the first data structures and query algorithms for ANN with arbitrarily good approximation factor, at the expense of increasing space usage and preprocessing time over existing methods. Query time complexity is comparable or significantly improved by our algorithms, our algorithm is especially efficient when the length of the curves is bounded.
In this paper, from a theoretical perspective, we study how powerful graph neural networks (GNNs) can be for learning approximation algorithms for combinatorial problems. To this end, we first establish a new class of GNNs that can solve strictly a wider variety of problems than existing GNNs. Then, we bridge the gap between GNN theory and the theory of distributed local algorithms to theoretically demonstrate that the most powerful GNN can learn approximation algorithms for the minimum dominating set problem and the minimum vertex cover problem with some approximation ratios and that no GNN can perform better than with these ratios. This paper is the first to elucidate approximation ratios of GNNs for combinatorial problems. Furthermore, we prove that adding coloring or weak-coloring to each node feature improves these approximation ratios. This indicates that preprocessing and feature engineering theoretically strengthen model capabilities.