We consider the problem of finding the matching map between two sets of $d$-dimensional noisy feature-vectors. The distinctive feature of our setting is that we do not assume that all the vectors of the first set have their corresponding vector in the second set. If $n$ and $m$ are the sizes of these two sets, we assume that the matching map that should be recovered is defined on a subset of unknown cardinality $k^*\le \min(n,m)$. We show that, in the high-dimensional setting, if the signal-to-noise ratio is larger than $5(d\log(4nm/\alpha))^{1/4}$, then the true matching map can be recovered with probability $1-\alpha$. Interestingly, this threshold does not depend on $k^*$ and is the same as the one obtained in prior work in the case of $k = \min(n,m)$. The procedure for which the aforementioned property is proved is obtained by a data-driven selection among candidate mappings $\{\hat\pi_k:k\in[\min(n,m)]\}$. Each $\hat\pi_k$ minimizes the sum of squares of distances between two sets of size $k$. The resulting optimization problem can be formulated as a minimum-cost flow problem, and thus solved efficiently. Finally, we report the results of numerical experiments on both synthetic and real-world data that illustrate our theoretical results and provide further insight into the properties of the algorithms studied in this work.
In this paper we give an algorithm for streaming $k$-edit approximate pattern matching which uses space $\widetilde{O}(k^2)$ and time $\widetilde{O}(k^2)$ per arriving symbol. This improves substantially on the recent algorithm of Kociumaka, Porat and Starikovskaya (2022) which uses space $\widetilde{O}(k^5)$ and time $\widetilde{O}(k^8)$ per arriving symbol. In the $k$-edit approximate pattern matching problem we get a pattern $P$ and text $T$ and we want to identify all substrings of the text $T$ that are at edit distance at most $k$ from $P$. In the streaming version of this problem both the pattern and the text arrive in a streaming fashion symbol by symbol and after each symbol of the text we need to report whether there is a current suffix of the text with edit distance at most $k$ from $P$. We measure the total space needed by the algorithm and time needed per arriving symbol.
The maximization of submodular functions have found widespread application in areas such as machine learning, combinatorial optimization, and economics, where practitioners often wish to enforce various constraints; the matroid constraint has been investigated extensively due to its algorithmic properties and expressive power. Recent progress has focused on fast algorithms for important classes of matroids given in explicit form. Currently, nearly-linear time algorithms only exist for graphic and partition matroids [ICALP '19]. In this work, we develop algorithms for monotone submodular maximization constrained by graphic, transversal matroids, or laminar matroids in time near-linear in the size of their representation. Our algorithms achieve an optimal approximation of $1-1/e-\epsilon$ and both generalize and accelerate the results of Ene and Nguyen [ICALP '19]. In fact, the running time of our algorithm cannot be improved within the fast continuous greedy framework of Badanidiyuru and Vondr\'ak [SODA '14]. To achieve near-linear running time, we make use of dynamic data structures that maintain bases with approximate maximum cardinality and weight under certain element updates. These data structures need to support a weight decrease operation and a novel FREEZE operation that allows the algorithm to freeze elements (i.e. force to be contained) in its basis regardless of future data structure operations. For the laminar matroid, we present a new dynamic data structure using the top tree interface of Alstrup, Holm, de Lichtenberg, and Thorup [TALG '05] that maintains the maximum weight basis under insertions and deletions of elements in $O(\log n)$ time. For the transversal matroid the FREEZE operation corresponds to requiring the data structure to keep a certain set $S$ of vertices matched, a property that we call $S$-stability.
The important Kemeny problem, which consists of computing median consensus rankings of an election with respect to the Kemeny voting rule, admits important applications in biology and computational social choice and was generalized recently via an interesting setwise approach by Gilbert et. al. Our first results establish optimal quantitative extensions of the Unanimity property and the well-known $3/4$-majority rule of Betzler et al. for the classical Kemeny median problem. Moreover, by elaborating an exhaustive list of quantified axiomatic properties (such as the Condorcet and Smith criteria, the $5/6$-majority rule, etc.) of the $3$-wise Kemeny rule where not only pairwise comparisons but also the discordance between the winners of subsets of three candidates are also taken into account, we come to the conclusion that the $3$-wise Kemeny voting scheme induced by the $3$-wise Kendall-tau distance presents interesting advantages in comparison with the classical Kemeny rule. For example, it satisfies several improved manipulation-proof properties. Since the $3$-wise Kemeny problem is NP-hard, our results also provide some of the first useful space reduction techniques by determining the relative orders of pairs of alternatives. Our works suggest similar interesting properties of higher setwise Kemeny voting schemes which justify and compensate for the more expensive computational cost than the classical Kemeny scheme.
Orbit recovery problems are a class of problems that often arise in practice and in various forms. In these problems, we aim to estimate an unknown function after being distorted by a group action and observed via a known operator. Typically, the observations are contaminated with a non-trivial level of noise. Two particular orbit recovery problems of interest in this paper are multireference alignment and single-particle cryo-EM modeling. In order to suppress the noise, we suggest using the method of moments approach for both problems while introducing deep neural network priors. In particular, our neural networks should output the signals and the distribution of group elements, with moments being the input. In the multireference alignment case, we demonstrate the advantage of using the NN to accelerate the convergence for the reconstruction of signals from the moments. Finally, we use our method to reconstruct simulated and biological volumes in the cryo-EM setting.
We study online learning in episodic constrained Markov decision processes (CMDPs), where the goal of the learner is to collect as much reward as possible over the episodes, while guaranteeing that some long-term constraints are satisfied during the learning process. Rewards and constraints can be selected either stochastically or adversarially, and the transition function is not known to the learner. While online learning in classical unconstrained MDPs has received considerable attention over the last years, the setting of CMDPs is still largely unexplored. This is surprising, since in real-world applications, such as, e.g., autonomous driving, automated bidding, and recommender systems, there are usually additional constraints and specifications that an agent has to obey during the learning process. In this paper, we provide the first best-of-both-worlds algorithm for CMDPs with long-term constraints. Our algorithm is capable of handling settings in which rewards and constraints are selected either stochastically or adversarially, without requiring any knowledge of the underling process. Moreover, our algorithm matches state-of-the-art regret and constraint violation bounds for settings in which constraints are selected stochastically, while it is the first to provide guarantees in the case in which they are chosen adversarially.
Matrix functions play an increasingly important role in many areas of scientific computing and engineering disciplines. In such real-world applications, algorithms working in floating-point arithmetic are used for computing matrix functions and additionally, input data might be unreliable, e.g., due to measurement errors. Therefore, it is crucial to understand the sensitivity of matrix functions to perturbations, which is measured by condition numbers. However, the condition number itself might not be computed exactly as well due to round-off and errors in the input. The sensitivity of the condition number is measured by the so-called level-2 condition number. For the usual (level-1) condition number, it is well-known that structured condition numbers (i.e., where only perturbations are taken into account that preserve the structure of the input matrix) might be much smaller than unstructured ones, which, e.g., suggests that structure-preserving algorithms for matrix functions might yield much more accurate results than general-purpose algorithms. In this work, we examine structured level-2 condition numbers in the particular case of restricting the perturbation matrix to an automorphism group, a Lie or Jordan algebra or the space of quasi-triangular matrices. In numerical experiments, we then compare the unstructured level-2 condition number with the structured one for some specific matrix functions such as the matrix logarithm, matrix square root, and matrix exponential.
Nonparametric two-sample testing is a classical problem in inferential statistics. While modern two-sample tests, such as the edge count test and its variants, can handle multivariate and non-Euclidean data, contemporary gargantuan datasets often exhibit heterogeneity due to the presence of latent subpopulations. Direct application of these tests, without regulating for such heterogeneity, may lead to incorrect statistical decisions. We develop a new nonparametric testing procedure that accurately detects differences between the two samples in the presence of unknown heterogeneity in the data generation process. Our framework handles this latent heterogeneity through a composite null that entertains the possibility that the two samples arise from a mixture distribution with identical component distributions but with possibly different mixing weights. In this regime, we study the asymptotic behavior of weighted edge count test statistic and show that it can be effectively re-calibrated to detect arbitrary deviations from the composite null. For practical implementation we propose a Bootstrapped Weighted Edge Count test which involves a bootstrap-based calibration procedure that can be easily implemented across a wide range of heterogeneous regimes. A comprehensive simulation study and an application to detecting aberrant user behaviors in online games demonstrates the excellent non-asymptotic performance of the proposed test.
We employ a toolset -- dubbed Dr. Frankenstein -- to analyse the similarity of representations in deep neural networks. With this toolset, we aim to match the activations on given layers of two trained neural networks by joining them with a stitching layer. We demonstrate that the inner representations emerging in deep convolutional neural networks with the same architecture but different initializations can be matched with a surprisingly high degree of accuracy even with a single, affine stitching layer. We choose the stitching layer from several possible classes of linear transformations and investigate their performance and properties. The task of matching representations is closely related to notions of similarity. Using this toolset, we also provide a novel viewpoint on the current line of research regarding similarity indices of neural network representations: the perspective of the performance on a task.
In the domain generalization literature, a common objective is to learn representations independent of the domain after conditioning on the class label. We show that this objective is not sufficient: there exist counter-examples where a model fails to generalize to unseen domains even after satisfying class-conditional domain invariance. We formalize this observation through a structural causal model and show the importance of modeling within-class variations for generalization. Specifically, classes contain objects that characterize specific causal features, and domains can be interpreted as interventions on these objects that change non-causal features. We highlight an alternative condition: inputs across domains should have the same representation if they are derived from the same object. Based on this objective, we propose matching-based algorithms when base objects are observed (e.g., through data augmentation) and approximate the objective when objects are not observed (MatchDG). Our simple matching-based algorithms are competitive to prior work on out-of-domain accuracy for rotated MNIST, Fashion-MNIST, PACS, and Chest-Xray datasets. Our method MatchDG also recovers ground-truth object matches: on MNIST and Fashion-MNIST, top-10 matches from MatchDG have over 50% overlap with ground-truth matches.
Seeking the equivalent entities among multi-source Knowledge Graphs (KGs) is the pivotal step to KGs integration, also known as \emph{entity alignment} (EA). However, most existing EA methods are inefficient and poor in scalability. A recent summary points out that some of them even require several days to deal with a dataset containing 200,000 nodes (DWY100K). We believe over-complex graph encoder and inefficient negative sampling strategy are the two main reasons. In this paper, we propose a novel KG encoder -- Dual Attention Matching Network (Dual-AMN), which not only models both intra-graph and cross-graph information smartly, but also greatly reduces computational complexity. Furthermore, we propose the Normalized Hard Sample Mining Loss to smoothly select hard negative samples with reduced loss shift. The experimental results on widely used public datasets indicate that our method achieves both high accuracy and high efficiency. On DWY100K, the whole running process of our method could be finished in 1,100 seconds, at least 10* faster than previous work. The performances of our method also outperform previous works across all datasets, where Hits@1 and MRR have been improved from 6% to 13%.