We use results from communication complexity, both new and old ones, to prove lower bounds for unambiguous finite automata (UFAs). We show three results. $\textit{Complement:}$ There is a language $L$ recognised by an $n$-state UFA such that the complement language $\overline{L}$ requires NFAs with $n^{\tilde{\Omega}(\log n)}$ states. This improves on a lower bound by Raskin. $\textit{Union:}$ There are languages $L_1$, $L_2$ recognised by $n$-state UFAs such that the union $L_1\cup L_2$ requires UFAs with $n^{\tilde{\Omega}(\log n)}$ states. $\textit{Separation:}$ There is a language $L$ such that both $L$ and $\overline{L}$ are recognised by $n$-state NFAs but such that $L$ requires UFAs with $n^{\Omega(\log n)}$ states. This refutes a conjecture by Colcombet.
Tokenization is an important text preprocessing step to prepare input tokens for deep language models. WordPiece and BPE are de facto methods employed by important models, such as BERT and GPT. However, the impact of tokenization can be different for morphologically rich languages, such as Turkic languages, where many words can be generated by adding prefixes and suffixes. We compare five tokenizers at different granularity levels, i.e. their outputs vary from smallest pieces of characters to the surface form of words, including a Morphological-level tokenizer. We train these tokenizers and pretrain medium-sized language models using RoBERTa pretraining procedure on the Turkish split of the OSCAR corpus. We then fine-tune our models on six downstream tasks. Our experiments, supported by statistical tests, reveal that Morphological-level tokenizer has challenging performance with de facto tokenizers. Furthermore, we find that increasing the vocabulary size improves the performance of Morphological and Word-level tokenizers more than that of de facto tokenizers. The ratio of the number of vocabulary parameters to the total number of model parameters can be empirically chosen as 20% for de facto tokenizers and 40% for other tokenizers to obtain a reasonable trade-off between model size and performance.
We study the problem of testing whether a function $f: \mathbb{R}^n \to \mathbb{R}$ is a polynomial of degree at most $d$ in the \emph{distribution-free} testing model. Here, the distance between functions is measured with respect to an unknown distribution $\mathcal{D}$ over $\mathbb{R}^n$ from which we can draw samples. In contrast to previous work, we do not assume that $\mathcal{D}$ has finite support. We design a tester that given query access to $f$, and sample access to $\mathcal{D}$, makes $(d/\varepsilon)^{O(1)}$ many queries to $f$, accepts with probability $1$ if $f$ is a polynomial of degree $d$, and rejects with probability at least $2/3$ if every degree-$d$ polynomial $P$ disagrees with $f$ on a set of mass at least $\varepsilon$ with respect to $\mathcal{D}$. Our result also holds under mild assumptions when we receive only a polynomial number of bits of precision for each query to $f$, or when $f$ can only be queried on rational points representable using a logarithmic number of bits. Along the way, we prove a new stability theorem for multivariate polynomials that may be of independent interest.
We study the distributed minimum spanning tree (MST) problem, a fundamental problem in distributed computing. It is well-known that distributed MST can be solved in $\tilde{O}(D+\sqrt{n})$ rounds in the standard CONGEST model (where $n$ is the network size and $D$ is the network diameter) and this is essentially the best possible round complexity (up to logarithmic factors). However, in resource-constrained networks such as ad hoc wireless and sensor networks, nodes spending so much time can lead to significant spending of resources such as energy. Motivated by the above consideration, we study distributed algorithms for MST under the \emph{sleeping model} [Chatterjee et al., PODC 2020], a model for design and analysis of resource-efficient distributed algorithms. In the sleeping model, a node can be in one of two modes in any round -- \emph{sleeping} or \emph{awake} (unlike the traditional model where nodes are always awake). Only the rounds in which a node is \emph{awake} are counted, while \emph{sleeping} rounds are ignored. A node spends resources only in the awake rounds and hence the main goal is to minimize the \emph{awake complexity} of a distributed algorithm, the worst-case number of rounds any node is awake. We present deterministic and randomized distributed MST algorithms that have an \emph{optimal} awake complexity of $O(\log n)$ time with a matching lower bound. We also show that our randomized awake-optimal algorithm has essentially the best possible round complexity by presenting a lower bound of $\tilde{\Omega}(n)$ on the product of the awake and round complexity of any distributed algorithm (including randomized) that outputs an MST, where $\tilde{\Omega}$ hides a $1/(\text{polylog } n)$ factor.
Convolutional neural networks (CNNs) are important in a wide variety of machine learning tasks and applications, so optimizing their performance is essential. Moving words of data between levels of a memory hierarchy or between processors on a network is much more expensive than the cost of arithmetic, so minimizing communication is critical to optimizing performance. In this paper, we present new lower bounds on data movement for mixed precision convolutions in both single-processor and parallel distributed memory models, as well as algorithms that outperform current implementations such as Im2Col. We obtain performance figures using GEMMINI, a machine learning accelerator, where our tiling provides improvements between 13% and 150% over a vendor supplied algorithm.
This paper presents new deterministic and distributed low-diameter decomposition algorithms for weighted graphs. In particular, we show that if one can efficiently compute approximate distances in a parallel or a distributed setting, one can also efficiently compute low-diameter decompositions. This consequently implies solutions to many fundamental distance based problems using a polylogarithmic number of approximate distance computations. Our low-diameter decomposition generalizes and extends the line of work starting from [Rozho\v{n}, Ghaffari STOC 2020] to weighted graphs in a very model-independent manner. Moreover, our clustering results have additional useful properties, including strong-diameter guarantees, separation properties, restricting cluster centers to specified terminals, and more. Applications include: -- The first near-linear work and polylogarithmic depth randomized and deterministic parallel algorithm for low-stretch spanning trees (LSST) with polylogarithmic stretch. Previously, the best parallel LSST algorithm required $m \cdot n^{o(1)}$ work and $n^{o(1)}$ depth and was inherently randomized. No deterministic LSST algorithm with truly sub-quadratic work and sub-linear depth was known. -- The first near-linear work and polylogarithmic depth deterministic algorithm for computing an $\ell_1$-embedding into polylogarithmic dimensional space with polylogarithmic distortion. The best prior deterministic algorithms for $\ell_1$-embeddings either require large polynomial work or are inherently sequential. Even when we apply our techniques to the classical problem of computing a ball-carving with strong-diameter $O(\log^2 n)$ in an unweighted graph, our new clustering algorithm still leads to an improvement in round complexity from $O(\log^{10} n)$ rounds [Chang, Ghaffari PODC 21] to $O(\log^{4} n)$.
We introduce a restriction of the classical 2-party deterministic communication protocol where Alice and Bob are restricted to using only comparison functions. We show that the complexity of a function in the model is, up to a constant factor, determined by a complexity measure analogous to Yao's tiling number, which we call the geometric tiling number which can be computed in polynomial time. As a warm-up, we consider an analogous restricted decision tree model and observe a 1-dimensional analog of the above results.
The Korkine--Zolotareff (KZ) reduction, and its generalisations, are widely used lattice reduction strategies in communications and cryptography. The KZ constant and Schnorr's constant were defined by Schnorr in 1987. The KZ constant can be used to quantify some useful properties of KZ reduced matrices. Schnorr's constant can be used to characterize the output quality of his block $2k$-reduction and is used to define his semi block $2k$-reduction, which was also developed in 1987. Hermite's constant, which is a fundamental constant lattices, has many applications, such as bounding the length of the shortest nonzero lattice vector and the orthogonality defect of lattices. Rankin's constant was introduced by Rankin in 1953 as a generalization of Hermite's constant. It plays an important role in characterizing the output quality of block-Rankin reduction, proposed by Gama et al. in 2006. In this paper, we first develop a linear upper bound on Hermite's constant and then use it to develop an upper bound on the KZ constant. These upper bounds are sharper than those obtained recently by the authors, and the ratio of the new linear upper bound to the nonlinear upper bound, developed by Blichfeldt in 1929, on Hermite's constant is asymptotically 1.0047. Furthermore, we develop lower and upper bounds on Schnorr's constant. The improvement to the lower bound over the sharpest existing one developed by Gama et al. is around 1.7 times asymptotically, and the improvement to the upper bound over the sharpest existing one which was also developed by Gama et al. is around 4 times asymptotically. Finally, we develop lower and upper bounds on Rankin's constant. The improvements of the bounds over the sharpest existing ones, also developed by Gama et al., are exponential in the parameter defining the constant.
We study dynamic algorithms for the problem of maximizing a monotone submodular function over a stream of $n$ insertions and deletions. We show that any algorithm that maintains a $(0.5+\epsilon)$-approximate solution under a cardinality constraint, for any constant $\epsilon>0$, must have an amortized query complexity that is $\mathit{polynomial}$ in $n$. Moreover, a linear amortized query complexity is needed in order to maintain a $0.584$-approximate solution. This is in sharp contrast with recent dynamic algorithms of [LMNF+20, Mon20] that achieve $(0.5-\epsilon)$-approximation with a $\mathsf{poly}\log(n)$ amortized query complexity. On the positive side, when the stream is insertion-only, we present efficient algorithms for the problem under a cardinality constraint and under a matroid constraint with approximation guarantee $1-1/e-\epsilon$ and amortized query complexities $\smash{O(\log (k/\epsilon)/\epsilon^2)}$ and $\smash{k^{\tilde{O}(1/\epsilon^2)}\log n}$, respectively, where $k$ denotes the cardinality parameter or the rank of the matroid.
The stochastic gradient Langevin Dynamics is one of the most fundamental algorithms to solve sampling problems and non-convex optimization appearing in several machine learning applications. Especially, its variance reduced versions have nowadays gained particular attention. In this paper, we study two variants of this kind, namely, the Stochastic Variance Reduced Gradient Langevin Dynamics and the Stochastic Recursive Gradient Langevin Dynamics. We prove their convergence to the objective distribution in terms of KL-divergence under the sole assumptions of smoothness and Log-Sobolev inequality which are weaker conditions than those used in prior works for these algorithms. With the batch size and the inner loop length set to $\sqrt{n}$, the gradient complexity to achieve an $\epsilon$-precision is $\tilde{O}((n+dn^{1/2}\epsilon^{-1})\gamma^2 L^2\alpha^{-2})$, which is an improvement from any previous analyses. We also show some essential applications of our result to non-convex optimization.
We introduce FRAT, a new proof format for unsatisfiable SAT problems, and its associated toolchain. Compared to DRAT, the FRAT format allows solvers to include more information in proofs to reduce the computational cost of subsequent elaboration to LRAT. The format is easy to parse forward and backward, and it is extensible to future proof methods. The provision of optional proof steps allows SAT solver developers to balance implementation effort against elaboration time, with little to no overhead on solver time. We benchmark our FRAT toolchain against a comparable DRAT toolchain and confirm >84% median reduction in elaboration time and >94% median decrease in peak memory usage.