The bad science matrix problem consists in finding, among all matrices $A \in \mathbb{R}^{n \times n}$ with rows having unit $\ell^2$ norm, one that maximizes $\beta(A) = \frac{1}{2^n} \sum_{x \in \{-1, 1\}^n} \|Ax\|_\infty$. Our main contribution is an explicit construction of an $n \times n$ matrix $A$ showing that $\beta(A) \geq \sqrt{\log_2(n+1)}$, which is only 18% smaller than the asymptotic rate. We prove that every entry of any optimal matrix is a square root of a rational number, and we find provably optimal matrices for $n \leq 4$.
We introduce the concept of an imprecise Markov semigroup $\mathbf{Q}$. It is a tool that allows to represent ambiguity around both the initial and the transition probabilities of a Markov process via a compact collection of plausible Markov semigroups, each associated with a (different, plausible) Markov process. We use techniques from geometry, functional analysis, and (high dimensional) probability to study the ergodic behavior of $\mathbf{Q}$. We show that, if the initial distribution of the Markov processes associated with the elements of $\mathbf{Q}$ is known and invariant, under some conditions that also involve the geometry of the state space, eventually the ambiguity around their transition probability fades. We call this property ergodicity of the imprecise Markov semigroup, and we relate it to the classical notion of ergodicity. We prove ergodicity both when the state space is Euclidean or a Riemannian manifold, and when it is an arbitrary measurable space. The importance of our findings for the fields of machine learning and computer vision is also discussed.
An abstract topological graph (AT-graph) is a pair $A=(G,\mathcal{X})$, where $G=(V,E)$ is a graph and $\mathcal{X} \subseteq {E \choose 2}$ is a set of pairs of edges of $G$. A realization of $A$ is a drawing $\Gamma_A$ of $G$ in the plane such that any two edges $e_1,e_2$ of $G$ cross in $\Gamma_A$ if and only if $(e_1,e_2) \in \mathcal{X}$; $\Gamma_A$ is simple if any two edges intersect at most once (either at a common endpoint or at a proper crossing). The AT-graph Realizability (ATR) problem asks whether an input AT-graph admits a realization. The version of this problem that requires a simple realization is called Simple AT-graph Realizability (SATR). It is a classical result that both ATR and SATR are NP-complete. In this paper, we study the SATR problem from a new structural perspective. More precisely, we consider the size $\mathrm{\lambda}(A)$ of the largest connected component of the crossing graph of any realization of $A$, i.e., the graph ${\cal C}(A) = (E, \mathcal{X})$. This parameter represents a natural way to measure the level of interplay among edge crossings. First, we prove that SATR is NP-complete when $\mathrm{\lambda}(A) \geq 6$. On the positive side, we give an optimal linear-time algorithm that solves SATR when $\mathrm{\lambda}(A) \leq 3$ and returns a simple realization if one exists. Our algorithm is based on several ingredients, in particular the reduction to a new embedding problem subject to constraints that require certain pairs of edges to alternate (in the rotation system), and a sequence of transformations that exploit the interplay between alternation constraints and the SPQR-tree and PQ-tree data structures to eventually arrive at a simpler embedding problem that can be solved with standard techniques.
A property $\Pi$ on a finite set $U$ is \emph{monotone} if for every $X \subseteq U$ satisfying $\Pi$, every superset $Y \subseteq U$ of $X$ also satisfies $\Pi$. Many combinatorial properties can be seen as monotone properties. The problem of finding a minimum subset of $U$ satisfying $\Pi$ is a central problem in combinatorial optimization. Although many approximate/exact algorithms have been developed to solve this kind of problem on numerous properties, a solution obtained by these algorithms is often unsuitable for real-world applications due to the difficulty of building accurate mathematical models on real-world problems. A promising approach to overcome this difficulty is to \emph{enumerate} multiple small solutions rather than to \emph{find} a single small solution. To this end, given a weight function $w: U \to \mathbb N$ and an integer $k$, we devise algorithms that \emph{approximately} enumerate all minimal subsets of $U$ with weight at most $k$ satisfying $\Pi$ for various monotone properties $\Pi$, where "approximate enumeration" means that algorithms output all minimal subsets satisfying $\Pi$ whose weight at most $k$ and may output some minimal subsets satisfying $\Pi$ whose weight exceeds $k$ but is at most $ck$ for some constant $c \ge 1$. These algorithms allow us to efficiently enumerate minimal vertex covers, minimal dominating sets in bounded degree graphs, minimal feedback vertex sets, minimal hitting sets in bounded rank hypergraphs, etc., of weight at most $k$ with constant approximation factors.
The generalization error (risk) of a supervised statistical learning algorithm quantifies its prediction ability on previously unseen data. Inspired by exponential tilting, Li et al. (2021) proposed the tilted empirical risk as a non-linear risk metric for machine learning applications such as classification and regression problems. In this work, we examine the generalization error of the tilted empirical risk. In particular, we provide uniform and information-theoretic bounds on the tilted generalization error, defined as the difference between the population risk and the tilted empirical risk, with a convergence rate of $O(1/\sqrt{n})$ where $n$ is the number of training samples. Furthermore, we study the solution to the KL-regularized expected tilted empirical risk minimization problem and derive an upper bound on the expected tilted generalization error with a convergence rate of $O(1/n)$.
Given a matrix $\mathbf{A} \in \mathbb{R}^{k \times n}$, a partitioning of $[k]$ into groups $S_1,\dots,S_m$, an outer norm $p$, and a collection of inner norms such that either $p \ge 1$ and $p_1,\dots,p_m \ge 2$ or $p_1=\dots=p_m=p \ge 1/\log n$, we prove that there is a sparse weight vector $\mathbf{\beta} \in \mathbb{R}^{m}$ such that $\sum_{i=1}^m \mathbf{\beta}_i \cdot \|\mathbf{A}_{S_i}\mathbf{x}\|_{p_i}^p \approx_{1\pm\varepsilon} \sum_{i=1}^m \|\mathbf{A}_{S_i}\mathbf{x}\|_{p_i}^p$, where the number of nonzero entries of $\mathbf{\beta}$ is at most $O_{p,p_i}(\varepsilon^{-2}n^{\max(1,p/2)}(\log n)^2(\log(n/\varepsilon)))$. When $p_1\dots,p_m \ge 2$, this weight vector arises from an importance sampling procedure based on the \textit{block Lewis weights}, a recently proposed generalization of Lewis weights. Additionally, we prove that there exist efficient algorithms to find the sparse weight vector $\mathbf{\beta}$ in several important regimes of $p$ and $p_1,\dots,p_m$. Our results imply a $\widetilde{O}(\varepsilon^{-1}\sqrt{n})$-linear system solve iteration complexity for the problem of minimizing sums of Euclidean norms, improving over the previously known $\widetilde{O}(\sqrt{m}\log({1/\varepsilon}))$ iteration complexity when $m \gg n$. Our main technical contribution is a substantial generalization of the \textit{change-of-measure} method that Bourgain, Lindenstrauss, and Milman used to obtain the analogous result when every group has size $1$. Our generalization allows one to analyze change of measures beyond those implied by D. Lewis's original construction, including the measure implied by the block Lewis weights and natural approximations of this measure.
We initiate a formal investigation into the design and analysis of LLM-based algorithms, i.e. algorithms that contain one or multiple calls of large language models (LLMs) as sub-routines and critically rely on the capabilities of LLMs. While LLM-based algorithms, ranging from basic LLM calls with prompt engineering to complicated LLM-powered agent systems and compound AI systems, have achieved remarkable empirical success, the design and optimization of them have mostly relied on heuristics and trial-and-errors, which is largely due to a lack of formal and analytical study for these algorithms. To fill this gap, we start by identifying the computational-graph representation of LLM-based algorithms, the design principle of task decomposition, and some key abstractions, which then facilitate our formal analysis for the accuracy and efficiency of LLM-based algorithms, despite the black-box nature of LLMs. Through extensive analytical and empirical investigation in a series of case studies, we demonstrate that the proposed framework is broadly applicable to a wide range of scenarios and diverse patterns of LLM-based algorithms, such as parallel, hierarchical and recursive task decomposition. Our proposed framework holds promise for advancing LLM-based algorithms, by revealing the reasons behind curious empirical phenomena, guiding the choices of hyperparameters, predicting the empirical performance of algorithms, and inspiring new algorithm design. To promote further study of LLM-based algorithms, we release our source code at //github.com/modelscope/agentscope/tree/main/examples/paper_llm_based_algorithm.
We propose an exact algorithm for the Graph Burning Problem ($\texttt{GBP}$), an NP-hard optimization problem that models the spread of influence on social networks. Given a graph $G$ with vertex set $V$, the objective is to find a sequence of $k$ vertices in $V$, namely, $v_1, v_2, \dots, v_k$, such that $k$ is minimum and $\bigcup_{i = 1}^{k} \{u\! \in\! V\! : d(u, v_i) \leq k - i\} = V$, where $d(u,v)$ denotes the distance between $u$ and $v$. We formulate the problem as a set covering integer programming model and design a row generation algorithm for the $\texttt{GBP}$. Our method exploits the fact that a very small number of covering constraints is often sufficient for solving the integer model, allowing the corresponding rows to be generated on demand. To date, the most efficient exact algorithm for the $\texttt{GBP}$, denoted here by $\texttt{GDCA}$, is able to obtain optimal solutions for graphs with up to 14,000 vertices within two hours of execution. In comparison, our algorithm finds provably optimal solutions approximately 236 times faster, on average, than $\texttt{GDCA}$. For larger graphs, memory space becomes a limiting factor for $\texttt{GDCA}$. Our algorithm, however, solves real-world instances with almost 200,000 vertices in less than 35 seconds, increasing the size of graphs for which optimal solutions are known by a factor of 14.
In a graph $G = (V,E)$, a k-ruling set $S$ is one in which all vertices $V$ \ $S$ are at most $k$ distance from $S$. Finding a minimum k-ruling set is intrinsically linked to the minimum dominating set problem and maximal independent set problem, which have been extensively studied in graph theory. This paper presents the first known algorithm for solving all k-ruling set problems in conjunction with known minimum dominating set algorithms at only additional polynomial time cost compared to a minimum dominating set. The algorithm further succeeds for $(\alpha, \alpha - 1)$ ruling sets in which $\alpha > 1$, for which constraints exist on the proximity of vertices v $\in S$. This secondary application instead works in conjunction with maximal independent set algorithms.
In-context learning (ICL) refers to a remarkable capability of pretrained large language models, which can learn a new task given a few examples during inference. However, theoretical understanding of ICL is largely under-explored, particularly whether transformers can be trained to generalize to unseen examples in a prompt, which will require the model to acquire contextual knowledge of the prompt for generalization. This paper investigates the training dynamics of transformers by gradient descent through the lens of non-linear regression tasks. The contextual generalization here can be attained via learning the template function for each task in-context, where all template functions lie in a linear space with $m$ basis functions. We analyze the training dynamics of one-layer multi-head transformers to in-contextly predict unlabeled inputs given partially labeled prompts, where the labels contain Gaussian noise and the number of examples in each prompt are not sufficient to determine the template. Under mild assumptions, we show that the training loss for a one-layer multi-head transformer converges linearly to a global minimum. Moreover, the transformer effectively learns to perform ridge regression over the basis functions. To our knowledge, this study is the first provable demonstration that transformers can learn contextual (i.e., template) information to generalize to both unseen examples and tasks when prompts contain only a small number of query-answer pairs.
In the solution discovery variant of a vertex (edge) subset problem $\Pi$ on graphs, we are given an initial configuration of tokens on the vertices (edges) of an input graph $G$ together with a budget $b$. The question is whether we can transform this configuration into a feasible solution of $\Pi$ on $G$ with at most $b$ modification steps. We consider the token sliding variant of the solution discovery framework, where each modification step consists of sliding a token to an adjacent vertex (edge). The framework of solution discovery was recently introduced by Fellows et al. [Fellows et al., ECAI 2023] and for many solution discovery problems the classical as well as the parameterized complexity has been established. In this work, we study the kernelization complexity of the solution discovery variants of Vertex Cover, Independent Set, Dominating Set, Shortest Path, Matching, and Vertex Cut with respect to the parameters number of tokens $k$, discovery budget $b$, as well as structural parameters such as pathwidth.