Two genomes over the same set of gene families form a canonical pair when each of them has exactly one gene from each family. Different distances of canonical genomes can be derived from a structure called breakpoint graph, which represents the relation between the two given genomes as a collection of cycles of even length and paths. Then, the breakpoint distance is equal to n - (c_2 + p_0/2), where n is the number of genes, c_2 is the number of cycles of length 2 and p_0 is the number of paths of length 0. Similarly, when the considered rearrangements are those modeled by the double-cut-and-join (DCJ) operation, the rearrangement distance is n - (c + p_e/2), where c is the total number of cycles and p_e is the total number of even paths. The distance formulation is a basic unit for several other combinatorial problems related to genome evolution and ancestral reconstruction, such as median or double distance. Interestingly, both median and double distance problems can be solved in polynomial time for the breakpoint distance, while they are NP-hard for the rearrangement distance. One way of exploring the complexity space between these two extremes is to consider the {\sigma}_k distance, defined to be n - [c_2 + c_4 + ... + c_k + (p_0 + p_2 + ... +p_k)/2], and increasingly investigate the complexities of median and double distance for the {\sigma}_4 distance, then the {\sigma}_6 distance, and so on. While for the median much effort was done in our and in other research groups but no progress was obtained even for the {\sigma}_4 distance, for solving the double distance under {\sigma}_4 and {\sigma}_6 distances we could devise linear time algorithms, which we present here.
Motivated by applications to noncoherent network coding, we study subspace codes defined by sets of linear cellular automata (CA). As a first remark, we show that a family of linear CA where the local rules have the same diameter -- and thus the associated polynomials have the same degree -- induces a Grassmannian code. Then, we prove that the minimum distance of such a code is determined by the maximum degree occurring among the pairwise greatest common divisors (GCD) of the polynomials in the family. Finally, we consider the setting where all such polynomials have the same GCD, and determine the cardinality of the corresponding Grassmannian code. As a particular case, we show that if all polynomials in the family are pairwise coprime, the resulting Grassmannian code has the highest minimum distance possible.
We examine the complexity of the ``Texas Hold'em'' variant of poker from a topological perspective. We show that there exists a natural simplicial complex governing the multi-way winning probabilities between various hands, and that this simplicial complex contains $4$-dimensional spheres as induced subcomplexes. We deduce that evaluating the strength of a pair of cards in Texas Hold'em is an intricate problem, and that even the notion of who is bluffing against whom is ill-defined in some situations.
A random algebraic graph is defined by a group ${G}$ with a "uniform" distribution over it and a connection $\sigma:{G}\longrightarrow [0,1]$ with expectation $p,$ satisfying $\sigma({g}) = \sigma({g}^{-1}).$ The random graph $\mathsf{RAG}(n,{G},p,\sigma)$ with vertex set $[n]$ is formed as follows. First, $n$ independent latent vectors ${x}_1, \ldots, {x}_n$ are sampled uniformly from ${G}.$ Then, vertices $i,j$ are connected with probability $\sigma({x}_i{x}_j^{-1}).$ This model captures random geometric graphs with latent space the unit sphere and the hypercube, certain regimes of the stochastic block model, and random subgraphs of Cayley graphs. The main question of interest to the current paper is: when is a random algebraic graph statistically and/or computationally distinguishable from $\mathsf{G}(n,p)$? Our results fall into two main categories. 1) Geometric. We focus on the case ${G} =\{\pm1\}^d$ and use Fourier-analytic tools. For hard threshold connections, we match [LMSY22b] for $p = \omega(1/n)$ and for connections that are $\frac{1}{r\sqrt{d}}$-Lipschitz we extend the results of [LR21b] when $d = \Omega(n\log n)$ to the non-monotone setting. 2) Algebraic. We provide evidence for an exponential statistical-computational gap. Consider any finite group ${G}$ and let $A\subseteq {G}$ be a set of elements formed by including each set of the form $\{{g}, {g}^{-1}\}$ independently with probability $1/2.$ Let $\Gamma_n({G},A)$ be the distribution of random graphs formed by taking a uniformly random induced subgraph of size $n$ of the Cayley graph $\Gamma({G},A).$ Then, $\Gamma_n({G}, A)$ and $\mathsf{G}(n,1/2)$ are statistically indistinguishable with high probability over $A$ if and only if $\log |{G}| \gtrsim n.$ However, low-degree polynomial tests fail to distinguish $\Gamma_n({G}, A)$ and $\mathsf{G}(n,1/2)$ with high probability over $A$ when $\log |{G}| = \log^{\Omega(1)}n.$
We give an approximate Menger-type theorem for when a graph $G$ contains two $X-Y$ paths $P_1$ and $P_2$ such that $P_1 \cup P_2$ is an induced subgraph of $G$. More generally, we prove that there exists a function $f(d) \in O(d)$, such that for every graph $G$ and $X,Y \subseteq V(G)$, either there exist two $X-Y$ paths $P_1$ and $P_2$ such that the distance between $P_1$ and $P_2$ is at least $d$, or there exists $v \in V(G)$ such that the ball of radius $f(d)$ centered at $v$ intersects every $X-Y$ path.
In a recent paper published in the Journal of Language Evolution, Kauhanen, Einhaus & Walkden (//doi.org/10.1093/jole/lzad005, KEW) challenge the results presented in one of my papers (Koplenig, Royal Society Open Science, 6, 181274 (2019), //doi.org/10.1098/rsos.181274), in which I tried to show through a series of statistical analyses that large numbers of L2 (second language) speakers do not seem to affect the (grammatical or statistical) complexity of a language. To this end, I focus on the way in which the Ethnologue assesses language status: a language is characterised as vehicular if, in addition to being used by L1 (first language) speakers, it should also have a significant number of L2 users. KEW criticise both the use of vehicularity as a (binary) indicator of whether a language has a significant number of L2 users and the idea of imputing a zero proportion of L2 speakers to non-vehicular languages whenever a direct estimate of that proportion is unavailable. While I recognise the importance of post-publication commentary on published research, I show in this rejoinder that both points of criticism are explicitly mentioned and analysed in my paper. In addition, I also comment on other points raised by KEW and demonstrate that both alternative analyses offered by KEW do not stand up to closer scrutiny.
Code based Language Models (LMs) have shown very promising results in the field of software engineering with applications such as code refinement, code completion and generation. However, the task of time and space complexity classification from code has not been extensively explored due to a lack of datasets, with prior endeavors being limited to Java. In this project, we aim to address these gaps by creating a labelled dataset of code snippets spanning multiple languages (Python and C++ datasets currently, with C, C#, and JavaScript datasets being released shortly). We find that existing time complexity calculation libraries and tools only apply to a limited number of use-cases. The lack of a well-defined rule based system motivates the application of several recently proposed code-based LMs. We demonstrate the effectiveness of dead code elimination and increasing the maximum sequence length of LMs. In addition to time complexity, we propose to use LMs to find space complexities from code, and to the best of our knowledge, this is the first attempt to do so. Furthermore, we introduce a novel code comprehension task, called cross-language transfer, where we fine-tune the LM on one language and run inference on another. Finally, we visualize the activation of the attention fed classification head of our LMs using Non-negative Matrix Factorization (NMF) to interpret our results.
We investigate random matrices whose entries are obtained by applying a nonlinear kernel function to pairwise inner products between $n$ independent data vectors, drawn uniformly from the unit sphere in $\mathbb{R}^d$. This study is motivated by applications in machine learning and statistics, where these kernel random matrices and their spectral properties play significant roles. We establish the weak limit of the empirical spectral distribution of these matrices in a polynomial scaling regime, where $d, n \to \infty$ such that $n / d^\ell \to \kappa$, for some fixed $\ell \in \mathbb{N}$ and $\kappa \in (0, \infty)$. Our findings generalize an earlier result by Cheng and Singer, who examined the same model in the linear scaling regime (with $\ell = 1$). Our work reveals an equivalence principle: the spectrum of the random kernel matrix is asymptotically equivalent to that of a simpler matrix model, constructed as a linear combination of a (shifted) Wishart matrix and an independent matrix sampled from the Gaussian orthogonal ensemble. The aspect ratio of the Wishart matrix and the coefficients of the linear combination are determined by $\ell$ and the expansion of the kernel function in the orthogonal Hermite polynomial basis. Consequently, the limiting spectrum of the random kernel matrix can be characterized as the free additive convolution between a Marchenko-Pastur law and a semicircle law. We also extend our results to cases with data vectors sampled from isotropic Gaussian distributions instead of spherical distributions.
In this paper we consider the generalized inverse iteration for computing ground states of the Gross-Pitaevskii eigenvector problem (GPE). For that we prove explicit linear convergence rates that depend on the maximum eigenvalue in magnitude of a weighted linear eigenvalue problem. Furthermore, we show that this eigenvalue can be bounded by the first spectral gap of a linearized Gross-Pitaevskii operator, recovering the same rates as for linear eigenvector problems. With this we establish the first local convergence result for the basic inverse iteration for the GPE without damping. We also show how our findings directly generalize to extended inverse iterations, such as the Gradient Flow Discrete Normalized (GFDN) proposed in [W. Bao, Q. Du, SIAM J. Sci. Comput., 25 (2004)] or the damped inverse iteration suggested in [P. Henning, D. Peterseim, SIAM J. Numer. Anal., 53 (2020)]. Our analysis also reveals why the inverse iteration for the GPE does not react favourably to spectral shifts. This empirical observation can now be explained with a blow-up of a weighting function that crucially contributes to the convergence rates. Our findings are illustrated by numerical experiments.
Motivated by an application from geodesy, we introduce a novel clustering problem which is a $k$-center (or k-diameter) problem with a side constraint. For the side constraint, we are given an undirected connectivity graph $G$ on the input points, and a clustering is now only feasible if every cluster induces a connected subgraph in $G$. We call the resulting problems the connected $k$-center problem and the connected $k$-diameter problem. We prove several results on the complexity and approximability of these problems. Our main result is an $O(\log^2{k})$-approximation algorithm for the connected $k$-center and the connected $k$-diameter problem. For Euclidean metrics and metrics with constant doubling dimension, the approximation factor of this algorithm improves to $O(1)$. We also consider the special cases that the connectivity graph is a line or a tree. For the line we give optimal polynomial-time algorithms and for the case that the connectivity graph is a tree, we either give an optimal polynomial-time algorithm or a $2$-approximation algorithm for all variants of our model. We complement our upper bounds by several lower bounds.
Given a matroid $M=(E,{\cal I})$, and a total ordering over the elements $E$, a broken circuit is a circuit where the smallest element is removed and an NBC independent set is an independent set in ${\cal I}$ with no broken circuit. The set of NBC independent sets of any matroid $M$ define a simplicial complex called the broken circuit complex which has been the subject of intense study in combinatorics. Recently, Adiprasito, Huh and Katz showed that the face of numbers of any broken circuit complex form a log-concave sequence, proving a long-standing conjecture of Rota. We study counting and optimization problems on NBC bases of a generic matroid. We find several fundamental differences with the independent set complex: for example, we show that it is NP-hard to find the max-weight NBC base of a matroid or that the convex hull of NBC bases of a matroid has edges of arbitrary large length. We also give evidence that the natural down-up walk on the space of NBC bases of a matroid may not mix rapidly by showing that for some family of matroids it is NP-hard to count the number of NBC bases after certain conditionings.