In this work, we revisit the fundamental and well-studied problem of approximate pattern matching under edit distance. Given an integer $k$, a pattern $P$ of length $m$, and a text $T$ of length $n \ge m$, the task is to find substrings of $T$ that are within edit distance $k$ from $P$. Our main result is a streaming algorithm that solves the problem in $\tilde{O}(k^5)$ space and $\tilde{O}(k^8)$ amortised time per character of the text, providing answers correct with high probability. (Hereafter, $\tilde{O}(\cdot)$ hides a $\mathrm{poly}(\log n)$ factor.) This answers a decade-old question: since the discovery of a $\mathrm{poly}(k\log n)$-space streaming algorithm for pattern matching under Hamming distance by Porat and Porat [FOCS 2009], the existence of an analogous result for edit distance remained open. Up to this work, no $\mathrm{poly}(k\log n)$-space algorithm was known even in the simpler semi-streaming model, where $T$ comes as a stream but $P$ is available for read-only access. In this model, we give a deterministic algorithm that achieves slightly better complexity. In order to develop the fully streaming algorithm, we introduce a new edit distance sketch parametrised by integers $n\ge k$. For any string of length at most $n$, the sketch is of size $\tilde{O}(k^2)$ and it can be computed with an $\tilde{O}(k^2)$-space streaming algorithm. Given the sketches of two strings, in $\tilde{O}(k^3)$ time we can compute their edit distance or certify that it is larger than $k$. This result improves upon $\tilde{O}(k^8)$-size sketches of Belazzougui and Zhu [FOCS 2016] and very recent $\tilde{O}(k^3)$-size sketches of Jin, Nelson, and Wu [STACS 2021].
We consider the stability of matchings when individuals strategically submit preference information to a publicly known algorithm. Most pure Nash equilibria of the ensuing game yield a matching that is unstable with respect to the individuals' sincere preferences. We introduce a well-supported minimal dishonesty constraint, and obtain conditions under which every pure Nash equilibrium yields a matching that is stable with respect to the sincere preferences. The conditions on the matching algorithm are to be either fully-randomized, or monotonic and independent of non-spouses (INS), an IIA-like property. These conditions are significant because they support the use of algorithms other than the Gale-Shapley (man-optimal) algorithm for kidney exchange and other applications. We prove that the Gale-Shapley algorithm always yields the woman-optimal matching when individuals are minimally dishonest. However, we give a negative answer to one of Gusfield and Irving's open questions: there is no monotonic INS or fully-randomized stable matching algorithm that is certain to yield the egalitarian-optimal matching when individuals are strategic and minimally dishonest. Finally, we show that these results extend to the student placement problem, where women are polyandrous but must be honest but do not extend to the admissions problem, where women are both polyandrous and strategic.
Despite huge successes reported by the field of machine learning, such as speech assistants or self-driving cars, businesses still observe very high failure rate when it comes to deployment of ML in production. We argue that part of the reason is infrastructure that was not designed for activities around data collection and analysis. We propose to consider flow-based programming with data streams as an alternative to commonly used service-oriented architectures for building software applications. To compare flow-based programming with the widespread service-oriented approach, we develop a data processing application, and formulate two subsequent ML-related tasks that constitute a complete cycle of ML deployment while allowing us to assess characteristics of each programming paradigm in the ML context. Employing both code metrics and empirical observations, we show that when it comes to ML deployment each paradigm has certain advantages and drawbacks. Our main conclusion is that while FBP shows great potential for providing infrastructural benefits for deployment of machine learning, it requires a lot of boilerplate code to define and manipulate the dataflow graph. We believe that with better developer tools in place this problem can be alleviated, establishing FBP as a strong alternative to currently prevalent SOA-driven software design approach. Additionally, we provide an insight into the trend of prioritising model development over data quality management.
The problem considered is the non-preemptive scheduling of independent jobs that consume a resource (which is non-renewable and replenished regularly) on parallel uniformly related machines. The input defines the speed of machines, size of jobs, the quantity of resource required by the jobs, the replenished quantities, and replenishment dates of the resource. Every job can start processing only after the required quantity of the resource is allocated to the job. The objective function is the minimization of the convex combination of the makespan and an objective that is equivalent to the $l_p$-norm of the vector of loads of the machines. We present an EPTAS for this problem. Prior to our work only a PTAS was known in this non-renewable resource settings and this PTAS was only for the special case of our problem of makespan minimization on identical machines.
Graph matching, also known as network alignment, refers to finding a bijection between the vertex sets of two given graphs so as to maximally align their edges. This fundamental computational problem arises frequently in multiple fields such as computer vision and biology. Recently, there has been a plethora of work studying efficient algorithms for graph matching under probabilistic models. In this work, we propose a new algorithm for graph matching: Our algorithm associates each vertex with a signature vector using a multistage procedure and then matches a pair of vertices from the two graphs if their signature vectors are close to each other. We show that, for two Erd\H{o}s--R\'enyi graphs with edge correlation $1-\alpha$, our algorithm recovers the underlying matching exactly with high probability when $\alpha \le 1 / (\log \log n)^C$, where $n$ is the number of vertices in each graph and $C$ denotes a positive universal constant. This improves the condition $\alpha \le 1 / (\log n)^C$ achieved in previous work.
We study the complexity of the decision problem known as Permutation Pattern Matching, or PPM. The input of PPM consists of a pair of permutations $\tau$ (the `text') and $\pi$ (the `pattern'), and the goal is to decide whether $\tau$ contains $\pi$ as a subpermutation. On general inputs, PPM is known to be NP-complete by a result of Bose, Buss and Lubiw. In this paper, we focus on restricted instances of PPM where the text is assumed to avoid a fixed (small) pattern $\sigma$; this restriction is known as Av($\sigma$)-PPM. It has been previously shown that Av($\sigma$)-PPM is polynomial for any $\sigma$ of size at most 3, while it is NP-hard for any $\sigma$ containing a monotone subsequence of length four. In this paper, we present a new hardness reduction which allows us to show, in a uniform way, that Av($\sigma$)-PPM is hard for every $\sigma$ of size at least 6, for every $\sigma$ of size 5 except the symmetry class of $41352$, as well as for every $\sigma$ symmetric to one of the three permutations $4321$, $4312$ and $4231$. Moreover, assuming the exponential time hypothesis, none of these hard cases of Av($\sigma$)-PPM can be solved in time $2^{o(n/\log n)}$. Previously, such conditional lower bound was not known even for the unconstrained PPM problem. On the tractability side, we combine the CSP approach of Guillemot and Marx with the structural results of Huczynska and Vatter to show that for any monotone-griddable permutation class C, PPM is polynomial when the text is restricted to a permutation from C.
In this paper, we investigate fast algorithms to approximate the Caputo derivative $^C_0D_t^\alpha u(t)$ when $\alpha$ is small. We focus on two fast algorithms, i.e. FIR and FIDR, both relying on the sum-of-exponential approximation to reduce the cost of evaluating the history part. FIR is the numerical scheme originally proposed in [16], and FIDR is an alternative scheme we propose in this work, and the latter shows superiority when $\alpha$ is small. With quantitative estimates, we prove that given a certain error threshold, the computational cost of evaluating the history part of the Caputo derivative can be decreased as $\alpha$ gets small. Hence, only minimal cost for the fast evaluation is required in the small $\alpha$ regime, which matches prevailing protocols in engineering practice. We also present a stability and error analysis of FIDR for solving linear fractional diffusion equations. Finally, we carry out systematic numerical studies for the performances of both FIR and FIDR schemes, where we explore the trade-off between accuracy and efficiency when $\alpha$ is small.
Restricted star colouring is a variant of star colouring introduced to design heuristic algorithms to estimate sparse Hessian matrices. For $k\in\mathbb{N}$, a $k$-restricted star colouring ($k$-rs colouring) of a graph $G$ is a function $f:V(G)\to{0,1,\dots,k-1}$ such that (i)$f(x)\neq f(y)$ for every edge $xy$ of G, and (ii) there is no bicoloured 3-vertex path ($P_3$) in $G$ with the higher colour on its middle vertex. We show that for $k\geq 3$, it is NP-complete to test whether a given planar bipartite graph of maximum degree $k$ and arbitrarily large girth admits a $k$-rs colouring, and thereby answer a problem posed by Shalu and Sandhya (Graphs and Combinatorics, 2016). In addition, it is NP-complete to test whether a 3-star colourable graph admits a 3-rs colouring. We also prove that for all $\epsilon > 0$, the optimization problem of restricted star colouring a 2-degenerate bipartite graph with the minimum number of colours is NP-hard to approximate within $n^{(1/3)-\epsilon}$. On the positive side, we design (i) a linear-time algorithm to test 3-rs colourability of trees, and (ii) an $O(n^3)$-time algorithm to test 3-rs colourability of chordal graphs.
The statistical properties of spectra of quantum systems within the framework of random matrix theory is widely used in many areas of physics. These properties are affected, if two or more sets of spectra are superposed, resulting from the discrete symmetries present in the system. Superposition of spectra of $m$ such circular orthogonal, unitary and symplectic ensembles are studied numerically using higher-order spacing ratios. For given $m$ and the Dyson index $\beta$, the modified index $\beta'$ is tabulated whose nearest neighbor spacing distribution is identical to that of $k$-th order spacing ratio. For the case of $m=2$ ($m=3$) in COE (CUE) a scaling relation between $\beta'$ and $k$ is given. For COE, it is conjectured that for $k=m+1$ ($m\geq2$) and $k=m-3$-th ($m\geq5$) order spacing ratio distribution the $\beta'$ is $m+2$ and $m-4$ respectively. Whereas in the case of CSE, for $k=m+1$ ($m\geq2$) and $k=m-1$-th ($m\geq3$) the $\beta'$ is $2m+3$ and $2(m-2)$ respectively. We also conjecture that for given $m$ ($k$) and $\beta$, the sequence of $\beta'$ as a function of $k$ ($m$) is unique. Strong numerical evidence in support of these results is presented. These results are tested on complex systems like the measured nuclear resonances, quantum chaotic kicked top and spin chains.
An important statistic in analyzing some (finite) network data, called \emph{PageRank}, and a related new statistic, which we call \emph{MarkovRank}, are studied in this paper. The PageRank was originally developed by the cofounders of \emph{Google}, Sergey Brin and Larry Page, to optimize the ranking of websites for their search engine outcomes, and it is computed using an iterative algorithm, based on the idea that nodes with a larger number of incoming edges are more important. The aim of this paper is to analyze the common features and some significant differences between the PageRank and the new Rank. A common merit of the two Ranks is that both statistics can be easily computed by either the mathematical computation or the iterative algorithm. According to the analysis of some examples, these two statistics seem to return somewhat different values, but the resulting rank statistics of both statistics are not far away from each other. One of the differences is that only MarkovRank has the property that its rank statistic does not depend on any tuning parameter, and it is determined only through the adjacency matrix for given network data. Moreover, it is also shown that the rank statistic of MarkovRank coincides with that of the stationary distribution vector of the corresponding Markov chain (with finite state space) whenever the Markov chain is regular. Thus, MarkovRank may have a potential to play a similar role to the well established PageRank from a practical point of view, not only commonly with light computational tasks, but also with some new theoretical validity.
To see is to sketch -- free-hand sketching naturally builds ties between human and machine vision. In this paper, we present a novel approach for translating an object photo to a sketch, mimicking the human sketching process. This is an extremely challenging task because the photo and sketch domains differ significantly. Furthermore, human sketches exhibit various levels of sophistication and abstraction even when depicting the same object instance in a reference photo. This means that even if photo-sketch pairs are available, they only provide weak supervision signal to learn a translation model. Compared with existing supervised approaches that solve the problem of D(E(photo)) -> sketch, where E($\cdot$) and D($\cdot$) denote encoder and decoder respectively, we take advantage of the inverse problem (e.g., D(E(sketch)) -> photo), and combine with the unsupervised learning tasks of within-domain reconstruction, all within a multi-task learning framework. Compared with existing unsupervised approaches based on cycle consistency (i.e., D(E(D(E(photo)))) -> photo), we introduce a shortcut consistency enforced at the encoder bottleneck (e.g., D(E(photo)) -> photo) to exploit the additional self-supervision. Both qualitative and quantitative results show that the proposed model is superior to a number of state-of-the-art alternatives. We also show that the synthetic sketches can be used to train a better fine-grained sketch-based image retrieval (FG-SBIR) model, effectively alleviating the problem of sketch data scarcity.