We give a $\widetilde{O}(n)$ time sampler for independent sets of a matroid with $n$ elements. As an application, there is a near-linear time sampler for the all-terminal network reliability.
Consider that there are $k\le n$ agents in a simple, connected, and undirected graph $G=(V,E)$ with $n$ nodes and $m$ edges. The goal of the dispersion problem is to move these $k$ agents to distinct nodes. Agents can communicate only when they are at the same node, and no other means of communication such as whiteboards are available. We assume that the agents operate synchronously. We consider two scenarios: when all agents are initially located at any single node (rooted setting) and when they are initially distributed over any one or more nodes (general setting). Kshemkalyani and Sharma presented a dispersion algorithm for the general setting, which uses $O(m_k)$ time and $\log(k+\delta)$ bits of memory per agent [OPODIS 2021]. Here, $m_k$ is the maximum number of edges in any induced subgraph of $G$ with $k$ nodes, and $\delta$ is the maximum degree of $G$. This algorithm is the fastest in the literature, as no algorithm with $o(m_k)$ time has been discovered even for the rooted setting. In this paper, we present faster algorithms for both the rooted and general settings. First, we present an algorithm for the rooted setting that solves the dispersion problem in $O(k\log \min(k,\delta))=O(k\log k)$ time using $O(\log \delta)$ bits of memory per agent. Next, we propose an algorithm for the general setting that achieves dispersion in $O(k (\log k)\cdot (\log \min(k,\delta))=O(k \log^2 k)$ time using $O(\log (k+\delta))$ bits.
A kernelization for a parameterized decision problem $\mathcal{Q}$ is a polynomial-time preprocessing algorithm that reduces any parameterized instance $(x,k)$ into an instance $(x',k')$ whose size is bounded by a function of $k$ alone and which has the same yes/no answer for $\mathcal{Q}$. Such preprocessing algorithms cannot exist in the context of counting problems, when the answer to be preserved is the number of solutions, since this number can be arbitrarily large compared to $k$. However, we show that for counting minimum feedback vertex sets of size at most $k$, and for counting minimum dominating sets of size at most $k$ in a planar graph, there is a polynomial-time algorithm that either outputs the answer or reduces to an instance $(G',k')$ of size polynomial in $k$ with the same number of minimum solutions. This shows that a meaningful theory of kernelization for counting problems is possible and opens the door for future developments. Our algorithms exploit that if the number of solutions exceeds $2^{\mathsf{poly}(k)}$, the size of the input is exponential in terms of $k$ so that the running time of a parameterized counting algorithm can be bounded by $\mathsf{poly}(n)$. Otherwise, we can use gadgets that slightly increase $k$ to represent choices among $2^{O(k)}$ options by only $\mathsf{poly}(k)$ vertices.
In this paper, we prove the following non-linear generalization of the classical Sylvester-Gallai theorem. Let $\mathbb{K}$ be an algebraically closed field of characteristic $0$, and $\mathcal{F}=\{F_1,\cdots,F_m\} \subset \mathbb{K}[x_1,\cdots,x_N]$ be a set of irreducible homogeneous polynomials of degree at most $d$ such that $F_i$ is not a scalar multiple of $F_j$ for $i\neq j$. Suppose that for any two distinct $F_i,F_j\in \mathcal{F}$, there is $k\neq i,j$ such that $F_k\in \mathrm{rad}(F_i,F_j)$. We prove that such radical SG configurations must be low dimensional. More precisely, we show that there exists a function $\lambda : \mathbb{N} \to \mathbb{N}$, independent of $\mathbb{K},N$ and $m$, such that any such configuration $\mathcal{F}$ must satisfy $$ \dim (\mathrm{span}_{\mathbb{K}}{\mathcal{F}}) \leq \lambda(d). $$ Our result confirms a conjecture of Gupta [Gup14, Conjecture 2] and generalizes the quadratic and cubic Sylvester-Gallai theorems of [S20,OS22]. Our result takes us one step closer towards the first deterministic polynomial time algorithm for the Polynomial Identity Testing (PIT) problem for depth-4 circuits of bounded top and bottom fanins. Our result, when combined with the Stillman uniformity type results of [AH20a,DLL19,ESS21], yields uniform bounds for several algebraic invariants such as projective dimension, Betti numbers and Castelnuovo-Mumford regularity of ideals generated by radical SG configurations.
We expound on some known lower bounds of the quadratic Wasserstein distance between random vectors in $\mathbb{R}^n$ with an emphasis on affine transformations that have been used in manifold learning of data in Wasserstein space. In particular, we give concrete lower bounds for rotated copies of random vectors in $\mathbb{R}^2$ with uncorrelated components by computing the Bures metric between the covariance matrices. We also derive upper bounds for compositions of affine maps which yield a fruitful variety of diffeomorphisms applied to an initial data measure. We apply these bounds to various distributions including those lying on a 1-dimensional manifold in $\mathbb{R}^2$ and illustrate the quality of the bounds. Finally, we give a framework for mimicking handwritten digit or alphabet datasets that can be applied in a manifold learning framework.
Recent experiments have shown that, often, when training a neural network with gradient descent (GD) with a step size $\eta$, the operator norm of the Hessian of the loss grows until it approximately reaches $2/\eta$, after which it fluctuates around this value. The quantity $2/\eta$ has been called the "edge of stability" based on consideration of a local quadratic approximation of the loss. We perform a similar calculation to arrive at an "edge of stability" for Sharpness-Aware Minimization (SAM), a variant of GD which has been shown to improve its generalization. Unlike the case for GD, the resulting SAM-edge depends on the norm of the gradient. Using three deep learning training tasks, we see empirically that SAM operates on the edge of stability identified by this analysis.
Explicit Runge--Kutta (\rk{}) methods are susceptible to a reduction in the observed order of convergence when applied to initial-boundary value problem with time-dependent boundary conditions. We study conditions on \erk{} methods that guarantee high-order convergence for linear problems; we refer to these conditions as weak stage order conditions. We prove a general relationship between the method's order, weak stage order, and number of stages. We derive \erk{} methods with high weak stage order and demonstrate, through numerical tests, that they avoid the order reduction phenomenon up to any order for linear problems and up to order three for nonlinear problems.
Foundation models, including Vision Language Models (VLMs) and Large Language Models (LLMs), possess the $generality$ to handle diverse distributions and tasks, which stems from their extensive pre-training datasets. The fine-tuning of foundation models is a common practice to enhance task performance or align the model's behavior with human expectations, allowing them to gain $speciality$. However, the small datasets used for fine-tuning may not adequately cover the diverse distributions and tasks encountered during pre-training. Consequently, the pursuit of speciality during fine-tuning can lead to a loss of {generality} in the model, which is related to catastrophic forgetting (CF) in deep learning. In this study, we demonstrate this phenomenon in both VLMs and LLMs. For instance, fine-tuning VLMs like CLIP on ImageNet results in a loss of generality in handling diverse distributions, and fine-tuning LLMs like Galactica in the medical domain leads to a loss in following instructions and common sense. To address the trade-off between the speciality and generality, we investigate multiple regularization methods from continual learning, the weight averaging method (Wise-FT) from out-of-distributional (OOD) generalization, which interpolates parameters between pre-trained and fine-tuned models, and parameter-efficient fine-tuning methods like Low-Rank Adaptation (LoRA). Our findings show that both continual learning and Wise-ft methods effectively mitigate the loss of generality, with Wise-FT exhibiting the strongest performance in balancing speciality and generality.
We consider the following natural problem that generalizes min-sum-radii clustering: Given is $k\in\mathbb{N}$ as well as some metric space $(V,d)$ where $V=F\cup C$ for facilities $F$ and clients $C$. The goal is to find a clustering given by $k$ facility-radius pairs $(f_1,r_1),\dots,(f_k,r_k)\in F\times\mathbb{R}_{\geq 0}$ such that $C\subseteq B(f_1,r_1)\cup\dots\cup B(f_k,r_k)$ and $\sum_{i=1,\dots,k} g(r_i)$ is minimized for some increasing function $g:\mathbb{R}_{\geq 0}\rightarrow\mathbb{R}_{\geq 0}$. Here, $B(x,r)$ is the radius-$r$ ball centered at $x$. For the case that $(V,d)$ is the shortest-path metric of some edge-weighted graph of bounded treewidth, we present a dynamic program that is tailored to this class of problems and achieves a polynomial running time, establishing that the problem is in $\mathsf{XP}$ with parameter treewidth.
In multi-turn dialog, utterances do not always take the full form of sentences \cite{Carbonell1983DiscoursePA}, which naturally makes understanding the dialog context more difficult. However, it is essential to fully grasp the dialog context to generate a reasonable response. Hence, in this paper, we propose to improve the response generation performance by examining the model's ability to answer a reading comprehension question, where the question is focused on the omitted information in the dialog. Enlightened by the multi-task learning scheme, we propose a joint framework that unifies these two tasks, sharing the same encoder to extract the common and task-invariant features with different decoders to learn task-specific features. To better fusing information from the question and the dialog history in the encoding part, we propose to augment the Transformer architecture with a memory updater, which is designed to selectively store and update the history dialog information so as to support downstream tasks. For the experiment, we employ human annotators to write and examine a large-scale dialog reading comprehension dataset. Extensive experiments are conducted on this dataset, and the results show that the proposed model brings substantial improvements over several strong baselines on both tasks. In this way, we demonstrate that reasoning can indeed help better response generation and vice versa. We release our large-scale dataset for further research.
While existing work in robust deep learning has focused on small pixel-level $\ell_p$ norm-based perturbations, this may not account for perturbations encountered in several real world settings. In many such cases although test data might not be available, broad specifications about the types of perturbations (such as an unknown degree of rotation) may be known. We consider a setup where robustness is expected over an unseen test domain that is not i.i.d. but deviates from the training domain. While this deviation may not be exactly known, its broad characterization is specified a priori, in terms of attributes. We propose an adversarial training approach which learns to generate new samples so as to maximize exposure of the classifier to the attributes-space, without having access to the data from the test domain. Our adversarial training solves a min-max optimization problem, with the inner maximization generating adversarial perturbations, and the outer minimization finding model parameters by optimizing the loss on adversarial perturbations generated from the inner maximization. We demonstrate the applicability of our approach on three types of naturally occurring perturbations -- object-related shifts, geometric transformations, and common image corruptions. Our approach enables deep neural networks to be robust against a wide range of naturally occurring perturbations. We demonstrate the usefulness of the proposed approach by showing the robustness gains of deep neural networks trained using our adversarial training on MNIST, CIFAR-10, and a new variant of the CLEVR dataset.