Multiple TSP ($\mathrm{mTSP}$) is a important variant of $\mathrm{TSP}$ where a set of $k$ salesperson together visit a set of $n$ cities. The $\mathrm{mTSP}$ problem has applications to many real life applications such as vehicle routing. Rothkopf introduced another variant of $\mathrm{TSP}$ called many-visits TSP ($\mathrm{MV\mbox{-}TSP}$) where a request $r(v)\in \mathbb{Z}_+$ is given for each city $v$ and a single salesperson needs to visit each city $r(v)$ times and return back to his starting point. A combination of $\mathrm{mTSP}$ and $\mathrm{MV\mbox{-}TSP}$ called many-visits multiple TSP $(\mathrm{MV\mbox{-}mTSP})$ was studied by B\'erczi, Mnich, and Vincze where the authors give approximation algorithms for various variants of $\mathrm{MV\mbox{-}mTSP}$. In this work, we show a simple linear programming (LP) based reduction that converts a $\mathrm{mTSP}$ LP-based algorithm to a LP-based algorithm for $\mathrm{MV\mbox{-}mTSP}$ with the same approximation factor. We apply this reduction to improve or match the current best approximation factors of several variants of the $\mathrm{MV\mbox{-}mTSP}$. Our reduction shows that the addition of visit requests $r(v)$ to $\mathrm{mTSP}$ does $\textit{not}$ make the problem harder to approximate even when $r(v)$ is exponential in number of vertices. To apply our reduction, we either use existing LP-based algorithms for $\mathrm{mTSP}$ variants or show that several existing combinatorial algorithms for $\mathrm{mTSP}$ variants can be interpreted as LP-based algorithms. This allows us to apply our reduction to these combinatorial algorithms as well achieving the improved guarantees.
Subset Sum Ratio is the following optimization problem: Given a set of $n$ positive numbers $I$, find disjoint subsets $X,Y \subseteq I$ minimizing the ratio $\max\{\Sigma(X)/\Sigma(Y),\Sigma(Y)/\Sigma(X)\}$, where $\Sigma(Z)$ denotes the sum of all elements of $Z$. Subset Sum Ratio is an optimization variant of the Equal Subset Sum problem. It was introduced by Woeginger and Yu in '92 and is known to admit an FPTAS [Bazgan, Santha, Tuza '98]. The best approximation schemes before this work had running time $O(n^4/\varepsilon)$ [Melissinos, Pagourtzis '18], $\tilde O(n^{2.3}/\varepsilon^{2.6})$ and $\tilde O(n^2/\varepsilon^3)$ [Alonistiotis et al. '22]. In this work, we present an improved approximation scheme for Subset Sum Ratio running in time $O(n / \varepsilon^{0.9386})$. Here we assume that the items are given in sorted order, otherwise we need an additional running time of $O(n \log n)$ for sorting. Our improved running time simultaneously improves the dependence on $n$ to linear and the dependence on $1/\varepsilon$ to sublinear. For comparison, the related Subset Sum problem admits an approximation scheme running in time $O(n/\varepsilon)$ [Gens, Levner '79]. If one would achieve an approximation scheme with running time $\tilde O(n / \varepsilon^{0.99})$ for Subset Sum, then one would falsify the Strong Exponential Time Hypothesis [Abboud, Bringmann, Hermelin, Shabtay '19] as well as the Min-Plus-Convolution Hypothesis [Bringmann, Nakos '21]. We thus establish that Subset Sum Ratio admits faster approximation schemes than Subset Sum. This comes as a surprise, since at any point in time before this work the best known approximation scheme for Subset Sum Ratio had a worse running time than the best known approximation scheme for Subset Sum.
Uniform sampling from the set $\mathcal{G}(\mathbf{d})$ of graphs with a given degree-sequence $\mathbf{d} = (d_1, \dots, d_n) \in \mathbb N^n$ is a classical problem in the study of random graphs. We consider an analogue for temporal graphs in which the edges are labeled with integer timestamps. The input to this generation problem is a tuple $\mathbf{D} = (\mathbf{d}, T) \in \mathbb N^n \times \mathbb N_{>0}$ and the task is to output a uniform random sample from the set $\mathcal{G}(\mathbf{D})$ of temporal graphs with degree-sequence $\mathbf{d}$ and timestamps in the interval $[1, T]$. By allowing repeated edges with distinct timestamps, $\mathcal{G}(\mathbf{D})$ can be non-empty even if $\mathcal{G}(\mathbf{d})$ is, and as a consequence, existing algorithms are difficult to apply. We describe an algorithm for this generation problem which runs in expected time $O(M)$ if $\Delta^{2+\epsilon} = O(M)$ for some constant $\epsilon > 0$ and $T - \Delta = \Omega(T)$ where $M = \sum_i d_i$ and $\Delta = \max_i d_i$. Our algorithm applies the switching method of McKay and Wormald $[1]$ to temporal graphs: we first generate a random temporal multigraph and then remove self-loops and duplicated edges with switching operations which rewire the edges in a degree-preserving manner.
For each of $T$ time steps, $m$ experts report probability distributions over $n$ outcomes; we wish to learn to aggregate these forecasts in a way that attains a no-regret guarantee. We focus on the fundamental and practical aggregation method known as logarithmic pooling -- a weighted average of log odds -- which is in a certain sense the optimal choice of pooling method if one is interested in minimizing log loss (as we take to be our loss function). We consider the problem of learning the best set of parameters (i.e. expert weights) in an online adversarial setting. We assume (by necessity) that the adversarial choices of outcomes and forecasts are consistent, in the sense that experts report calibrated forecasts. Imposing this constraint creates a (to our knowledge) novel semi-adversarial setting in which the adversary retains a large amount of flexibility. In this setting, we present an algorithm based on online mirror descent that learns expert weights in a way that attains $O(\sqrt{T} \log T)$ expected regret as compared with the best weights in hindsight.
In generative compressed sensing (GCS), we want to recover a signal $\mathbf{x}^* \in \mathbb{R}^n$ from $m$ measurements ($m\ll n$) using a generative prior $\mathbf{x}^*\in G(\mathbb{B}_2^k(r))$, where $G$ is typically an $L$-Lipschitz continuous generative model and $\mathbb{B}_2^k(r)$ represents the radius-$r$ $\ell_2$-ball in $\mathbb{R}^k$. Under nonlinear measurements, most prior results are non-uniform, i.e., they hold with high probability for a fixed $\mathbf{x}^*$ rather than for all $\mathbf{x}^*$ simultaneously. In this paper, we build a unified framework to derive uniform recovery guarantees for nonlinear GCS where the observation model is nonlinear and possibly discontinuous or unknown. Our framework accommodates GCS with 1-bit/uniformly quantized observations and single index models as canonical examples. Specifically, using a single realization of the sensing ensemble and generalized Lasso, {\em all} $\mathbf{x}^*\in G(\mathbb{B}_2^k(r))$ can be recovered up to an $\ell_2$-error at most $\epsilon$ using roughly $\tilde{O}({k}/{\epsilon^2})$ samples, with omitted logarithmic factors typically being dominated by $\log L$. Notably, this almost coincides with existing non-uniform guarantees up to logarithmic factors, hence the uniformity costs very little. As part of our technical contributions, we introduce the Lipschitz approximation to handle discontinuous observation models. We also develop a concentration inequality that produces tighter bounds for product processes whose index sets have low metric entropy. Experimental results are presented to corroborate our theory.
Consider the {$\ell_{\alpha}$} regularized linear regression, also termed Bridge regression. For $\alpha\in (0,1)$, Bridge regression enjoys several statistical properties of interest such as sparsity and near-unbiasedness of the estimates (Fan and Li, 2001). However, the main difficulty lies in the non-convex nature of the penalty for these values of $\alpha$, which makes an optimization procedure challenging and usually it is only possible to find a local optimum. To address this issue, Polson et al. (2013) took a sampling based fully Bayesian approach to this problem, using the correspondence between the Bridge penalty and a power exponential prior on the regression coefficients. However, their sampling procedure relies on Markov chain Monte Carlo (MCMC) techniques, which are inherently sequential and not scalable to large problem dimensions. Cross validation approaches are similarly computation-intensive. To this end, our contribution is a novel \emph{non-iterative} method to fit a Bridge regression model. The main contribution lies in an explicit formula for Stein's unbiased risk estimate for the out of sample prediction risk of Bridge regression, which can then be optimized to select the desired tuning parameters, allowing us to completely bypass MCMC as well as computation-intensive cross validation approaches. Our procedure yields results in a fraction of computational times compared to iterative schemes, without any appreciable loss in statistical performance. An R implementation is publicly available online at: //github.com/loriaJ/Sure-tuned_BridgeRegression .
This paper studies the prediction of a target $\mathbf{z}$ from a pair of random variables $(\mathbf{x},\mathbf{y})$, where the ground-truth predictor is additive $\mathbb{E}[\mathbf{z} \mid \mathbf{x},\mathbf{y}] = f_\star(\mathbf{x}) +g_{\star}(\mathbf{y})$. We study the performance of empirical risk minimization (ERM) over functions $f+g$, $f \in F$ and $g \in G$, fit on a given training distribution, but evaluated on a test distribution which exhibits covariate shift. We show that, when the class $F$ is "simpler" than $G$ (measured, e.g., in terms of its metric entropy), our predictor is more resilient to $\textbf{heterogenous covariate shifts}$ in which the shift in $\mathbf{x}$ is much greater than that in $\mathbf{y}$. Our analysis proceeds by demonstrating that ERM behaves $\textbf{qualitatively similarly to orthogonal machine learning}$: the rate at which ERM recovers the $f$-component of the predictor has only a lower-order dependence on the complexity of the class $G$, adjusted for partial non-indentifiability introduced by the additive structure. These results rely on a novel H\"older style inequality for the Dudley integral which may be of independent interest. Moreover, we corroborate our theoretical findings with experiments demonstrating improved resilience to shifts in "simpler" features across numerous domains.
Let $\Gamma$ be a simple connected graph on $n$ vertices, and let $C$ be a code of length $n$ whose coordinates are indexed by the vertices of $\Gamma$. We say that $C$ is a \textit{storage code} on $\Gamma$ if for any codeword $c \in C$, one can recover the information on each coordinate of $c$ by accessing its neighbors in $\Gamma$. The main problem here is to construct high-rate storage codes on triangle-free graphs. In this paper, we solve an open problem posed by Barg and Z\'emor in 2022, showing that the BCH family of storage codes is of unit rate. Furthermore, we generalize the construction of the BCH family and obtain more storage codes of unit rate on triangle-free graphs.
Consider that there are $k\le n$ agents in a simple, connected, and undirected graph $G=(V,E)$ with $n$ nodes and $m$ edges. The goal of the dispersion problem is to move these $k$ agents to distinct nodes. Agents can communicate only when they are at the same node, and no other means of communication such as whiteboards are available. We assume that the agents operate synchronously. We consider two scenarios: when all agents are initially located at any single node (rooted setting) and when they are initially distributed over any one or more nodes (general setting). Kshemkalyani and Sharma presented a dispersion algorithm for the general setting, which uses $O(m_k)$ time and $\log(k+\delta)$ bits of memory per agent [OPODIS 2021]. Here, $m_k$ is the maximum number of edges in any induced subgraph of $G$ with $k$ nodes, and $\delta$ is the maximum degree of $G$. This algorithm is the fastest in the literature, as no algorithm with $o(m_k)$ time has been discovered even for the rooted setting. In this paper, we present faster algorithms for both the rooted and general settings. First, we present an algorithm for the rooted setting that solves the dispersion problem in $O(k\log \min(k,\delta))=O(k\log k)$ time using $O(\log \delta)$ bits of memory per agent. Next, we propose an algorithm for the general setting that achieves dispersion in $O(k (\log k)\cdot (\log \min(k,\delta))=O(k \log^2 k)$ time using $O(\log (k+\delta))$ bits.
In the classical transformer attention scheme, we are given three $n \times d$ size matrices $Q, K, V$ (the query, key, and value tokens), and the goal is to compute a new $n \times d$ size matrix $D^{-1} \exp(QK^\top) V$ where $D = \mathrm{diag}( \exp(QK^\top) {\bf 1}_n )$. In this work, we study a generalization of attention which captures triple-wise correlations. This generalization is able to solve problems about detecting triple-wise connections that were shown to be impossible for transformers. The potential downside of this generalization is that it appears as though computations are even more difficult, since the straightforward algorithm requires cubic time in $n$. However, we show that in the bounded-entry setting (which arises in practice, and which is well-studied in both theory and practice), there is actually a near-linear time algorithm. More precisely, we show that bounded entries are both necessary and sufficient for quickly performing generalized computations: $\bullet$ On the positive side, if all entries of the input matrices are bounded above by $o(\sqrt[3]{\log n})$ then we show how to approximate the ``tensor-type'' attention matrix in $n^{1+o(1)}$ time. $\bullet$ On the negative side, we show that if the entries of the input matrices may be as large as $\Omega(\sqrt[3]{\log n})$, then there is no algorithm that runs faster than $n^{3-o(1)}$ (assuming the Strong Exponential Time Hypothesis from fine-grained complexity theory). We also show that our construction, algorithms, and lower bounds naturally generalize to higher-order tensors and correlations. Interestingly, the higher the order of the tensors, the lower the bound on the entries needs to be for an efficient algorithm. Our results thus yield a natural tradeoff between the boundedness of the entries, and order of the tensor one may use for more expressive, efficient attention computation.
In this paper, we prove the following non-linear generalization of the classical Sylvester-Gallai theorem. Let $\mathbb{K}$ be an algebraically closed field of characteristic $0$, and $\mathcal{F}=\{F_1,\cdots,F_m\} \subset \mathbb{K}[x_1,\cdots,x_N]$ be a set of irreducible homogeneous polynomials of degree at most $d$ such that $F_i$ is not a scalar multiple of $F_j$ for $i\neq j$. Suppose that for any two distinct $F_i,F_j\in \mathcal{F}$, there is $k\neq i,j$ such that $F_k\in \mathrm{rad}(F_i,F_j)$. We prove that such radical SG configurations must be low dimensional. More precisely, we show that there exists a function $\lambda : \mathbb{N} \to \mathbb{N}$, independent of $\mathbb{K},N$ and $m$, such that any such configuration $\mathcal{F}$ must satisfy $$ \dim (\mathrm{span}_{\mathbb{K}}{\mathcal{F}}) \leq \lambda(d). $$ Our result confirms a conjecture of Gupta [Gup14, Conjecture 2] and generalizes the quadratic and cubic Sylvester-Gallai theorems of [S20,OS22]. Our result takes us one step closer towards the first deterministic polynomial time algorithm for the Polynomial Identity Testing (PIT) problem for depth-4 circuits of bounded top and bottom fanins. Our result, when combined with the Stillman uniformity type results of [AH20a,DLL19,ESS21], yields uniform bounds for several algebraic invariants such as projective dimension, Betti numbers and Castelnuovo-Mumford regularity of ideals generated by radical SG configurations.