We study the following combinatorial problem. Given a set of $n$ y-monotone \emph{wires}, a \emph{tangle} determines the order of the wires on a number of horizontal \emph{layers} such that the orders of the wires on any two consecutive layers differ only in swaps of neighboring wires. Given a multiset~$L$ of \emph{swaps} (that is, unordered pairs of wires) and an initial order of the wires, a tangle \emph{realizes}~$L$ if each pair of wires changes its order exactly as many times as specified by~$L$. \textsc{List-Feasibility} is the problem of finding a tangle that realizes a given list~$L$ if such a tangle exists. \textsc{Tangle-Height Minimization} is the problem of finding a tangle that realizes a given list and additionally uses the minimum number of layers. \textsc{List-Feasibility} (and therefore \textsc{Tangle-Height Minimization}) is NP-hard [Yamanaka, Horiyama, Uno, Wasa; CCCG 2018]. We prove that \textsc{List-Feasibility} remains NP-hard if every pair of wires swaps only a constant number of times. On the positive side, we present an algorithm for \textsc{Tangle-Height Minimization} that computes an optimal tangle for $n$ wires and a given list~$L$ of swaps in $O((2|L|/n^2+1)^{n^2/2} \cdot \varphi^n \cdot n)$ time, where $\varphi \approx 1.618$ is the golden ratio and $|L|$ is the total number of swaps in~$L$. From this algorithm, we derive a simpler and faster version to solve \textsc{List-Feasibility}. We also use the algorithm to show that \textsc{List-Feasibility} is in NP and fixed-parameter tractable with respect to the number of wires. For \emph{simple} lists, where every swap occurs at most once, we show how to solve \textsc{Tangle-Height Minimization} in $O(n!\varphi^n)$ time.
We conduct a systematic study of the approximation properties of Transformer for sequence modeling with long, sparse and complicated memory. We investigate the mechanisms through which different components of Transformer, such as the dot-product self-attention, positional encoding and feed-forward layer, affect its expressive power, and we study their combined effects through establishing explicit approximation rates. Our study reveals the roles of critical parameters in the Transformer, such as the number of layers and the number of attention heads, and these insights also provide natural suggestions for alternative architectures.
In the last two decades, the linear model of coregionalization (LMC) has been widely used to model multivariate spatial processes. From a computational standpoint, the LMC is a substantially easier model to work with than other multidimensional alternatives. Up to now, this fact has been largely overlooked in the literature. Starting from an analogy with matrix normal models, we propose a reformulation of the LMC likelihood that highlights the linear, rather than cubic, computational complexity as a function of the dimension of the response vector. Further, we describe in detail how those simplifications can be included in a Gaussian hierarchical model. In addition, we demonstrate in two examples how the disentangled version of the likelihood we derive can be exploited to improve Markov chain Monte Carlo (MCMC) based computations when conducting Bayesian inference. The first is an interwoven approach that combines samples from centered and whitened parametrizations of the latent LMC distributed random fields. The second is a sparsity-inducing method that introduces structural zeros in the coregionalization matrix in an attempt to reduce the number of parameters in a principled way. It also provides a new way to investigate the strength of the correlation among the components of the outcome vector. Both approaches come at virtually no additional cost and are shown to significantly improve MCMC performance and predictive performance respectively. We apply our methodology to a dataset comprised of air pollutant measurements in the state of California.
For positive integers $d$ and $p$ such that $d \ge p$, we obtain complete asymptotic expansions, for large $d$, of the normalizing constants for the matrix Bingham and matrix Langevin distributions on Stiefel manifolds. The accuracy of each truncated expansion is strictly increasing in $d$; also, for sufficiently large $d$, the accuracy is strictly increasing in $m$, the number of terms in the truncated expansion. We apply these results to obtain the rate of convergence of these asymptotic expansions if both $d, p \to \infty$. Using values of $d$ and $p$ arising in various data sets, we illustrate the rate of convergence of the truncated approximations as $d$ or $m$ increases. These results extend our recent work on asymptotic expansions for the normalizing constants of the high-dimensional Bingham distributions.
Let $(X, d)$ be a metric space and $C \subseteq 2^X$ -- a collection of special objects. In the $(X,d,C)$-chasing problem, an online player receives a sequence of online requests $\{B_t\}_{t=1}^T \subseteq C$ and responds with a trajectory $\{x_t\}_{t=1}^T$ such that $x_t \in B_t$. This response incurs a movement cost $\sum_{t=1}^T d(x_t, x_{t-1})$, and the online player strives to minimize the competitive ratio -- the worst case ratio over all input sequences between the online movement cost and the optimal movement cost in hindsight. Under this setup, we call the $(X,d,C)$-chasing problem $\textit{chaseable}$ if there exists an online algorithm with finite competitive ratio. In the case of Convex Body Chasing (CBC) over real normed vector spaces, (Bubeck et al. 2019) proved the chaseability of the problem. Furthermore, in the vector space setting, the dimension of the ambient space appears to be the factor controlling the size of the competitive ratio. Indeed, recently, (Sellke 2020) provided a $d-$competitive online algorithm over arbitrary real normed vector spaces $(\mathbb{R}^d, ||\cdot||)$, and we will shortly present a general strategy for obtaining novel lower bounds of the form $\Omega(d^c), \enspace c > 0$, for CBC in the same setting. In this paper, we also prove that the $\textit{doubling}$ and $\textit{Assouad}$ dimensions of a metric space exert no control on the hardness of ball chasing over the said metric space. More specifically, we show that for any large enough $\rho \in \mathbb{R}$, there exists a metric space $(X,d)$ of doubling dimension $\Theta(\rho)$ and Assouad dimension $\rho$ such that no online selector can achieve a finite competitive ratio in the general ball chasing regime.
Given a set $P$ of $n$ points in the plane, in general position, denote by $N_\Delta(P)$ the number of empty triangles with vertices in $P$. In this paper we investigate by how much $N_\Delta(P)$ changes if a point $x$ is removed from $P$. By constructing a graph $G_P(x)$ based on the arrangement of the empty triangles incident on $x$, we transform this geometric problem to the problem of counting triangles in the graph $G_P(x)$. We study properties of the graph $G_P(x)$ and, in particular, show that it is kite-free. This relates the growth rate of the number of empty triangles to the famous Ruzsa-Szemer\'edi problem.
In the maximum independent set of convex polygons problem, we are given a set of $n$ convex polygons in the plane with the objective of selecting a maximum cardinality subset of non-overlapping polygons. Here we study a special case of the problem where the edges of the polygons can take at most $d$ fixed directions. We present an $8d/3$-approximation algorithm for this problem running in time $O((nd)^{O(d4^d)})$. The previous-best polynomial-time approximation (for constant $d$) was a classical $n^\varepsilon$ approximation by Fox and Pach [SODA'11] that has recently been improved to a $OPT^{\varepsilon}$-approximation algorithm by Cslovjecsek, Pilipczuk and W\k{e}grzycki [SODA '24], which also extends to an arbitrary set of convex polygons. Our result builds on, and generalizes the recent constant factor approximation algorithms for the maximum independent set of axis-parallel rectangles problem (which is a special case of our problem with $d=2$) by Mitchell [FOCS'21] and G\'{a}lvez, Khan, Mari, M\"{o}mke, Reddy, and Wiese [SODA'22].
Order-invariant first-order logic is an extension of first-order logic (FO) where formulae can make use of a linear order on the structures, under the proviso that they are order-invariant, i.e. that their truth value is the same for all linear orders. We continue the study of the two-variable fragment of order-invariant first-order logic initiated by Zeume and Harwath, and study its complexity and expressive power. We first establish coNExpTime-completeness for the problem of deciding if a given two-variable formula is order-invariant, which tightens and significantly simplifies the coN2ExpTime proof by Zeume and Harwath. Second, we address the question of whether every property expressible in order-invariant two-variable logic is also expressible in first-order logic without the use of a linear order. While we were not able to provide a satisfactory answer to the question, we suspect that the answer is ``no''. To justify our claim, we present a class of finite tree-like structures (of unbounded degree) in which a relaxed variant of order-invariant two-variable FO expresses properties that are not definable in plain FO. On the other hand, we show that if one restricts their attention to classes of structures of bounded degree, then the expressive power of order-invariant two-variable FO is contained within FO.
We study a sequential binary prediction setting where the forecaster is evaluated in terms of the calibration distance, which is defined as the $L_1$ distance between the predicted values and the set of predictions that are perfectly calibrated in hindsight. This is analogous to a calibration measure recently proposed by B{\l}asiok, Gopalan, Hu and Nakkiran (STOC 2023) for the offline setting. The calibration distance is a natural and intuitive measure of deviation from perfect calibration, and satisfies a Lipschitz continuity property which does not hold for many popular calibration measures, such as the $L_1$ calibration error and its variants. We prove that there is a forecasting algorithm that achieves an $O(\sqrt{T})$ calibration distance in expectation on an adversarially chosen sequence of $T$ binary outcomes. At the core of this upper bound is a structural result showing that the calibration distance is accurately approximated by the lower calibration distance, which is a continuous relaxation of the former. We then show that an $O(\sqrt{T})$ lower calibration distance can be achieved via a simple minimax argument and a reduction to online learning on a Lipschitz class. On the lower bound side, an $\Omega(T^{1/3})$ calibration distance is shown to be unavoidable, even when the adversary outputs a sequence of independent random bits, and has an additional ability to early stop (i.e., to stop producing random bits and output the same bit in the remaining steps). Interestingly, without this early stopping, the forecaster can achieve a much smaller calibration distance of $\mathrm{polylog}(T)$.
We prove an exponential separation between depth 2 and depth 3 neural networks, when approximating an $\mathcal{O}(1)$-Lipschitz target function to constant accuracy, with respect to a distribution with support in $[0,1]^{d}$, assuming exponentially bounded weights. This addresses an open problem posed in \citet{safran2019depth}, and proves that the curse of dimensionality manifests in depth 2 approximation, even in cases where the target function can be represented efficiently using depth 3. Previously, lower bounds that were used to separate depth 2 from depth 3 required that at least one of the Lipschitz parameter, target accuracy or (some measure of) the size of the domain of approximation scale polynomially with the input dimension, whereas we fix the former two and restrict our domain to the unit hypercube. Our lower bound holds for a wide variety of activation functions, and is based on a novel application of an average- to worst-case random self-reducibility argument, to reduce the problem to threshold circuits lower bounds.
We study the problem of symmetric matrix completion, where the goal is to reconstruct a positive semidefinite matrix $\rm{X}^\star \in \mathbb{R}^{d\times d}$ of rank-$r$, parameterized by $\rm{U}\rm{U}^{\top}$, from only a subset of its observed entries. We show that the vanilla gradient descent (GD) with small initialization provably converges to the ground truth $\rm{X}^\star$ without requiring any explicit regularization. This convergence result holds true even in the over-parameterized scenario, where the true rank $r$ is unknown and conservatively over-estimated by a search rank $r'\gg r$. The existing results for this problem either require explicit regularization, a sufficiently accurate initial point, or exact knowledge of the true rank $r$. In the over-parameterized regime where $r'\geq r$, we show that, with $\widetilde\Omega(dr^9)$ observations, GD with an initial point $\|\rm{U}_0\| \leq \epsilon$ converges near-linearly to an $\epsilon$-neighborhood of $\rm{X}^\star$. Consequently, smaller initial points result in increasingly accurate solutions. Surprisingly, neither the convergence rate nor the final accuracy depends on the over-parameterized search rank $r'$, and they are only governed by the true rank $r$. In the exactly-parameterized regime where $r'=r$, we further enhance this result by proving that GD converges at a faster rate to achieve an arbitrarily small accuracy $\epsilon>0$, provided the initial point satisfies $\|\rm{U}_0\| = O(1/d)$. At the crux of our method lies a novel weakly-coupled leave-one-out analysis, which allows us to establish the global convergence of GD, extending beyond what was previously possible using the classical leave-one-out analysis.