We present a randomized algorithm that computes single-source shortest paths (SSSP) in $O(m\log^8(n)\log W)$ time when edge weights are integral and can be negative. This essentially resolves the classic negative-weight SSSP problem. The previous bounds are $\tilde O((m+n^{1.5})\log W)$ [BLNPSSSW FOCS'20] and $m^{4/3+o(1)}\log W$ [AMV FOCS'20]. Near-linear time algorithms were known previously only for the special case of planar directed graphs [Fakcharoenphol and Rao FOCS'01]. In contrast to all recent developments that rely on sophisticated continuous optimization methods and dynamic algorithms, our algorithm is simple: it requires only a simple graph decomposition and elementary combinatorial tools. In fact, ours is the first combinatorial algorithm for negative-weight SSSP to break through the classic $\tilde O(m\sqrt{n}\log W)$ bound from over three decades ago [Gabow and Tarjan SICOMP'89].
On an orientable surface $S$, consider a collection $\Gamma$ of closed curves. The (geometric) intersection number $i_S(\Gamma)$ is the minimum number of self-intersections that a collection $\Gamma'$ can have, where $\Gamma'$ results from a continuous deformation (homotopy) of $\Gamma$. We provide algorithms that compute $i_S(\Gamma)$ and such a $\Gamma'$, assuming that $\Gamma$ is given by a collection of closed walks of length $n$ in a graph $M$ cellularly embedded on $S$, in $O(n \log n)$ time when $M$ and $S$ are fixed. The state of the art is a paper of Despr\'e and Lazarus [SoCG 2017, J. ACM 2019], who compute $i_S(\Gamma)$ in $O(n^2)$ time, and $\Gamma'$ in $O(n^4)$ time if $\Gamma$ is a single closed curve. Our result is more general since we can put an arbitrary number of closed curves in minimal position. Also, our algorithms are quasi-linear in $n$ instead of quadratic and quartic, and our proofs are simpler and shorter. We use techniques from two-dimensional topology and from the theory of hyperbolic surfaces. Most notably, we prove a new property of the reducing triangulations introduced by Colin de Verdi\`ere, Despr\'e, and Dubois [SODA 2024], reducing our problem to the case of surfaces with boundary. As a key subroutine, we rely on an algorithm of Fulek and T\'oth [JCO 2020].
Most mathematical distortions used in ML are fundamentally integral in nature: $f$-divergences, Bregman divergences, (regularized) optimal transport distances, integral probability metrics, geodesic distances, etc. In this paper, we unveil a grounded theory and tools which can help improve these distortions to better cope with ML requirements. We start with a generalization of Riemann integration that also encapsulates functions that are not strictly additive but are, more generally, $t$-additive, as in nonextensive statistical mechanics. Notably, this recovers Volterra's product integral as a special case. We then generalize the Fundamental Theorem of calculus using an extension of the (Euclidean) derivative. This, along with a series of more specific Theorems, serves as a basis for results showing how one can specifically design, alter, or change fundamental properties of distortion measures in a simple way, with a special emphasis on geometric- and ML-related properties that are the metricity, hyperbolicity, and encoding. We show how to apply it to a problem that has recently gained traction in ML: hyperbolic embeddings with a "cheap" and accurate encoding along the hyperbolic vs Euclidean scale. We unveil a new application for which the Poincar\'e disk model has very appealing features, and our theory comes in handy: \textit{model} embeddings for boosted combinations of decision trees, trained using the log-loss (trees) and logistic loss (combinations).
Recently, Ainsworth et al. showed that using weight matching (WM) to minimize the $L_2$ distance in a permutation search of model parameters effectively identifies permutations that satisfy linear mode connectivity (LMC), in which the loss along a linear path between two independently trained models with different seeds remains nearly constant. This paper provides a theoretical analysis of LMC using WM, which is crucial for understanding stochastic gradient descent's effectiveness and its application in areas like model merging. We first experimentally and theoretically show that permutations found by WM do not significantly reduce the $L_2$ distance between two models and the occurrence of LMC is not merely due to distance reduction by WM in itself. We then provide theoretical insights showing that permutations can change the directions of the singular vectors, but not the singular values, of the weight matrices in each layer. This finding shows that permutations found by WM mainly align the directions of singular vectors associated with large singular values across models. This alignment brings the singular vectors with large singular values, which determine the model functionality, closer between pre-merged and post-merged models, so that the post-merged model retains functionality similar to the pre-merged models, making it easy to satisfy LMC. Finally, we analyze the difference between WM and straight-through estimator (STE), a dataset-dependent permutation search method, and show that WM outperforms STE, especially when merging three or more models.
The \emph{Fast Gaussian Transform} (FGT) enables subquadratic-time multiplication of an $n\times n$ Gaussian kernel matrix $\mathsf{K}_{i,j}= \exp ( - \| x_i - x_j \|_2^2 ) $ with an arbitrary vector $h \in \mathbb{R}^n$, where $x_1,\dots, x_n \in \mathbb{R}^d$ are a set of \emph{fixed} source points. This kernel plays a central role in machine learning and random feature maps. Nevertheless, in most modern data analysis applications, datasets are dynamically changing (yet often have low rank), and recomputing the FGT from scratch in (kernel-based) algorithms incurs a major computational overhead ($\gtrsim n$ time for a single source update $\in \mathbb{R}^d$). These applications motivate a \emph{dynamic FGT} algorithm, which maintains a dynamic set of sources under \emph{kernel-density estimation} (KDE) queries in \emph{sublinear time} while retaining Mat-Vec multiplication accuracy and speed. Assuming the dynamic data-points $x_i$ lie in a (possibly changing) $k$-dimensional subspace ($k\leq d$), our main result is an efficient dynamic FGT algorithm, supporting the following operations in $\log^{O(k)}(n/\varepsilon)$ time: (1) Adding or deleting a source point, and (2) Estimating the ``kernel-density'' of a query point with respect to sources with $\varepsilon$ additive accuracy. The core of the algorithm is a dynamic data structure for maintaining the \emph{projected} ``interaction rank'' between source and target boxes, decoupled into finite truncation of Taylor and Hermite expansions.
Attention computation takes both the time complexity of $O(n^2)$ and the space complexity of $O(n^2)$ simultaneously, which makes deploying Large Language Models (LLMs) in streaming applications that involve long contexts requiring substantial computational resources. In recent OpenAI DevDay (Nov 6, 2023), OpenAI released a new model that is able to support a 128K-long document, in our paper, we focus on the memory-efficient issue when context length $n$ is much greater than 128K ($n \gg 2^d$). Considering a single-layer self-attention with Query, Key, and Value matrices $Q, K, V \in \mathbb{R}^{n \times d}$, the polynomial method approximates the attention output $T \in \mathbb{R}^{n \times d}$. It accomplishes this by constructing $U_1, U_2 \in \mathbb{R}^{n \times t}$ to expedite attention ${\sf Attn}(Q, K, V)$ computation within $n^{1+o(1)}$ time executions. Despite this, computing the approximated attention matrix $U_1U_2^\top \in \mathbb{R}^{n \times n}$ still necessitates $O(n^2)$ space, leading to significant memory usage. In response to these challenges, we introduce a new algorithm that only reads one pass of the data in a streaming fashion. This method employs sublinear space $o(n)$ to store three sketch matrices, alleviating the need for exact $K, V$ storage. Notably, our algorithm exhibits exceptional memory-efficient performance with super-long tokens. As the token length $n$ increases, our error guarantee diminishes while the memory usage remains nearly constant. This unique attribute underscores the potential of our technique in efficiently handling LLMs in streaming applications.
The class XNLP consists of (parameterized) problems that can be solved nondeterministically in $f(k)n^{O(1)}$ time and $f(k)\log n$ space, where $n$ is the size of the input instance and $k$ the parameter. The class XALP consists of problems that can be solved in the above time and space with access to an additional stack. These two classes are a "natural home" for many standard graph problems and their generalizations. In this paper, we show the hardness of several problems on planar graphs, parameterized by outerplanarity, treewidth and pathwidth, thus strengthening several existing results. In particular, we show the XNLP-hardness of the following problems parameterized by outerplanarity: All-or-Nothing Flow, Target Outdegree Orientation, Capacitated (Red-Blue) Dominating Set, Target Set Selections etc. We also show the XNLP-completeness of Scattered Set parameterized by pathwidth and XALP-completeness parameterized by treewidth and outerplanarity.
Recently, a mask-based beamformer with attention-based spatial covariance matrix aggregator (ASA) was proposed, which was demonstrated to track moving sources accurately. However, the deep neural network model used in this algorithm is limited to a specific channel configuration, requiring a different model in case a different channel permutation, channel count, or microphone array geometry is considered. Addressing this limitation, in this paper, we investigate three approaches to improve the robustness of the ASA-based tracking method against such variations: incorporating random channel configurations during the training process, employing the transform-average-concatenate (TAC) method to process multi-channel input features (allowing for any channel count and enabling permutation invariance), and utilizing input features that are robust against variations of the channel configuration. Our experiments, conducted using the CHiME-3 and DEMAND datasets, demonstrate improved robustness against mismatches in channel permutations, channel counts, and microphone array geometries compared to the conventional ASA-based tracking method without compromising performance in matched conditions, suggesting that the mask-based beamformer with ASA integrating the proposed approaches has the potential to track moving sources for arbitrary microphone arrays.
We construct a randomized vector quantizer which has a smaller maximum error compared to all known lattice quantizers with the same entropy for dimensions 5, 6, ..., 48, and also has a smaller mean squared error compared to known lattice quantizers with the same entropy for dimensions 35, ..., 48, in the high resolution limit. Moreover, our randomized quantizer has a desirable property that the quantization error is always uniform over the ball and independent of the input. Our construction is based on applying rejection sampling on universal quantization, which allows us to shape the error distribution to be any continuous distribution, not only uniform distributions over basic cells of a lattice as in conventional dithered quantization. We also characterize the high SNR limit of one-shot channel simulation for any additive noise channel under a mild assumption (e.g., the AWGN channel), up to an additive constant of 1.45 bits.
A popular and flexible time series model for counts is the generalized integer autoregressive process of order $p$, GINAR($p$). These Markov processes are defined using thinning operators evaluated on past values of the process along with a discretely-valued innovation process. This class includes the commonly used INAR($p$) process, defined with binomial thinning and Poisson innovations. GINAR processes can be used in a variety of settings, including modeling time series with low counts, and allow for more general mean-variance relationships, capturing both over- or under-dispersion. While there are many thinning operators and innovation processes given in the literature, less focus has been spent on comparing statistical inference and forecasting procedures over different choices of GINAR process. We provide an extensive study of exact and approximate inference and forecasting methods that can be applied to a wide class of GINAR($p$) processes with general thinning and innovation parameters. We discuss the challenges of exact estimation when $p$ is larger. We summarize and extend asymptotic results for estimators of process parameters, and present simulations to compare small sample performance, highlighting how different methods compare. We illustrate this methodology by fitting GINAR processes to a disease surveillance series.
We consider a variant of matrix completion where entries are revealed in a biased manner, adopting a model akin to that introduced by Ma and Chen. Instead of treating this observation bias as a disadvantage, as is typically the case, the goal is to exploit the shared information between the bias and the outcome of interest to improve predictions. Towards this, we consider a natural model where the observation pattern and outcome of interest are driven by the same set of underlying latent or unobserved factors. This leads to a two stage matrix completion algorithm: first, recover (distances between) the latent factors by utilizing matrix completion for the fully observed noisy binary matrix corresponding to the observation pattern; second, utilize the recovered latent factors as features and sparsely observed noisy outcomes as labels to perform non-parametric supervised learning. The finite-sample error rates analysis suggests that, ignoring logarithmic factors, this approach is competitive with the corresponding supervised learning parametric rates. This implies the two-stage method has performance that is comparable to having access to the unobserved latent factors through exploiting the shared information between the bias and outcomes. Through empirical evaluation using a real-world dataset, we find that with this two-stage algorithm, the estimates have 30x smaller mean squared error compared to traditional matrix completion methods, suggesting the utility of the model and the method proposed in this work.