We study Leaky ResNets, which interpolate between ResNets ($\tilde{L}=0$) and Fully-Connected nets ($\tilde{L}\to\infty$) depending on an 'effective depth' hyper-parameter $\tilde{L}$. In the infinite depth limit, we study 'representation geodesics' $A_{p}$: continuous paths in representation space (similar to NeuralODEs) from input $p=0$ to output $p=1$ that minimize the parameter norm of the network. We give a Lagrangian and Hamiltonian reformulation, which highlight the importance of two terms: a kinetic energy which favors small layer derivatives $\partial_{p}A_{p}$ and a potential energy that favors low-dimensional representations, as measured by the 'Cost of Identity'. The balance between these two forces offers an intuitive understanding of feature learning in ResNets. We leverage this intuition to explain the emergence of a bottleneck structure, as observed in previous work: for large $\tilde{L}$ the potential energy dominates and leads to a separation of timescales, where the representation jumps rapidly from the high dimensional inputs to a low-dimensional representation, move slowly inside the space of low-dimensional representations, before jumping back to the potentially high-dimensional outputs. Inspired by this phenomenon, we train with an adaptive layer step-size to adapt to the separation of timescales.
Since Harrow, Hassidim, and Lloyd (2009) showed that a system of linear equations with $N$ variables and condition number $\kappa$ can be solved on a quantum computer in $\operatorname{poly}(\log(N), \kappa)$ time, exponentially faster than any classical algorithms, its improvements and applications have been extensively investigated. The state-of-the-art quantum algorithm for this problem is due to Costa, An, Sanders, Su, Babbush, and Berry (2022), with optimal query complexity $\Theta(\kappa)$. An important question left is whether parallelism can bring further optimization. In this paper, we study the limitation of parallel quantum computing on this problem. We show that any quantum algorithm for solving systems of linear equations with time complexity $\operatorname{poly}(\log(N), \kappa)$ has a lower bound of $\Omega(\kappa)$ on the depth of queries, which is tight up to a constant factor.
We prove that for any integers $\alpha, \beta > 1$, the existential fragment of the first-order theory of the structure $\langle \mathbb{Z}; 0,1,<, +, \alpha^{\mathbb{N}}, \beta^{\mathbb{N}}\rangle$ is decidable (where $\alpha^{\mathbb{N}}$ is the set of positive integer powers of $\alpha$, and likewise for $\beta^{\mathbb{N}}$). On the other hand, we show by way of hardness that decidability of the existential fragment of the theory of $\langle \mathbb{N}; 0,1, <, +, x\mapsto \alpha^x, x \mapsto \beta^x\rangle$ for any multiplicatively independent $\alpha,\beta > 1$ would lead to mathematical breakthroughs regarding base-$\alpha$ and base-$\beta$ expansions of certain transcendental numbers.
We study the algorithmic task of finding large independent sets in Erdos-Renyi $r$-uniform hypergraphs on $n$ vertices having average degree $d$. Krivelevich and Sudakov showed that the maximum independent set has density $\left(\frac{r\log d}{(r-1)d}\right)^{1/(r-1)}$. We show that the class of low-degree polynomial algorithms can find independent sets of density $\left(\frac{\log d}{(r-1)d}\right)^{1/(r-1)}$ but no larger. This extends and generalizes earlier results of Gamarnik and Sudan, Rahman and Virag, and Wein on graphs, and answers a question of Bal and Bennett. We conjecture that this statistical-computational gap holds for this problem. Additionally, we explore the universality of this gap by examining $r$-partite hypergraphs. A hypergraph $H=(V,E)$ is $r$-partite if there is a partition $V=V_1\cup\cdots\cup V_r$ such that each edge contains exactly one vertex from each set $V_i$. We consider the problem of finding large balanced independent sets (independent sets containing the same number of vertices in each partition) in random $r$-partite hypergraphs with $n$ vertices in each partition and average degree $d$. We prove that the maximum balanced independent set has density $\left(\frac{r\log d}{(r-1)d}\right)^{1/(r-1)}$ asymptotically. Furthermore, we prove an analogous low-degree computational threshold of $\left(\frac{\log d}{(r-1)d}\right)^{1/(r-1)}$. Our results recover and generalize recent work of Perkins and the second author on bipartite graphs. While the graph case has been extensively studied, this work is the first to consider statistical-computational gaps of optimization problems on random hypergraphs. Our results suggest that these gaps persist for larger uniformities as well as across many models. A somewhat surprising aspect of the gap for balanced independent sets is that the algorithm achieving the lower bound is a simple degree-1 polynomial.
This paper deals with sufficient conditions on the distribution of the random variable $H$, in the model $X =\Pi_C(H)$, for the convex hull $\widehat C_N$ of $N$ independent copies of $X$ to be a consistent estimator of the convex body $C$ with a rate of convergence. The convergence of $\widehat C_N$ is established for the Hausdorff distance under a uniform condition on the distribution of $H$, but also in a pointwise sense under a less demanding condition. Some of these convergence results on $\widehat C_N$ are applied to the estimation of the time-dependent constraint set involved in a discrete-time Skorokhod problem.
For an integer $b\ge 0$, a $b$-matching in a graph $G=(V,E)$ is a set $S\subseteq E$ such that each vertex $v\in V$ is incident to at most $b$ edges in $S$. We design a fully polynomial-time approximation scheme (FPTAS) for counting the number of $b$-matchings in graphs with bounded degrees. Our FPTAS also applies to a broader family of counting problems, namely Holant problems with log-concave signatures. Our algorithm is based on Moitra's linear programming approach (JACM'19). Using a novel construction called the extended coupling tree, we derandomize the coupling designed by Chen and Gu (SODA'24).
We consider the problem of determining the manifold $n$-widths of Sobolev and Besov spaces with error measured in the $L_p$-norm. The manifold widths control how efficiently these spaces can be approximated by general non-linear parametric methods with the restriction that the parameter selection and parameterization maps must be continuous. Existing upper and lower bounds only match when the Sobolev or Besov smoothness index $q$ satisfies $q\leq p$ or $1 \leq p \leq 2$. We close this gap and obtain sharp lower bounds for all $1 \leq p,q \leq \infty$ for which a compact embedding holds. A key part of our analysis is to determine the exact value of the manifold widths of finite dimensional $\ell^M_q$-balls in the $\ell_p$-norm when $p\leq q$. Although this result is not new, we provide a new proof and apply it to lower bounding the manifold widths of Sobolev and Besov spaces. Our results show that the Bernstein widths, which are typically used to lower bound the manifold widths, decay asymptotically faster than the manifold widths in many cases.
Given a graph $G=(V,E)$, a function $f:V\to \{0,1,2\}$ is said to be a \emph{Roman Dominating function} if for every $v\in V$ with $f(v)=0$, there exists a vertex $u\in N(v)$ such that $f(u)=2$. A Roman Dominating function $f$ is said to be an \emph{Independent Roman Dominating function} (or IRDF), if $V_1\cup V_2$ forms an independent set, where $V_i=\{v\in V~\vert~f(v)=i\}$, for $i\in \{0,1,2\}$. The total weight of $f$ is equal to $\sum_{v\in V} f(v)$, and is denoted as $w(f)$. The \emph{Independent Roman Domination Number} of $G$, denoted by $i_R(G)$, is defined as min$\{w(f)~\vert~f$ is an IRDF of $G\}$. For a given graph $G$, the problem of computing $i_R(G)$ is defined as the \emph{Minimum Independent Roman Domination problem}. The problem is already known to be NP-hard for bipartite graphs. In this paper, we further study the algorithmic complexity of the problem. In this paper, we propose a polynomial-time algorithm to solve the Minimum Independent Roman Domination problem for distance-hereditary graphs, split graphs, and $P_4$-sparse graphs.
The {\em discrepancy} of a matrix $M \in \mathbb{R}^{d \times n}$ is given by $\mathrm{DISC}(M) := \min_{\boldsymbol{x} \in \{-1,1\}^n} \|M\boldsymbol{x}\|_\infty$. An outstanding conjecture, attributed to Koml\'os, stipulates that $\mathrm{DISC}(M) = O(1)$, whenever $M$ is a Koml\'os matrix, that is, whenever every column of $M$ lies within the unit sphere. Our main result asserts that $\mathrm{DISC}(M + R/\sqrt{d}) = O(d^{-1/2})$ holds asymptotically almost surely, whenever $M \in \mathbb{R}^{d \times n}$ is Koml\'os, $R \in \mathbb{R}^{d \times n}$ is a Rademacher random matrix, $d = \omega(1)$, and $n = \omega(d \log d)$. The factor $d^{-1/2}$ normalising $R$ is essentially best possible and the dependency between $n$ and $d$ is asymptotically best possible. Our main source of inspiration is a result by Bansal, Jiang, Meka, Singla, and Sinha (ICALP 2022). They obtained an assertion similar to the one above in the case that the smoothing matrix is Gaussian. They asked whether their result can be attained with the optimal dependency $n = \omega(d \log d)$ in the case of Bernoulli random noise or any other types of discretely distributed noise; the latter types being more conducive for Smoothed Analysis in other discrepancy settings such as the Beck-Fiala problem. For Bernoulli noise, their method works if $n = \omega(d^2)$. In the case of Rademacher noise, we answer the question posed by Bansal, Jiang, Meka, Singla, and Sinha. Our proof builds upon their approach in a strong way and provides a discrete version of the latter. Breaking the $n = \omega(d^2)$ barrier and reaching the optimal dependency $n = \omega(d \log d)$ for Rademacher noise requires additional ideas expressed through a rather meticulous counting argument, incurred by the need to maintain a high level of precision all throughout the discretisation process.
Let $X$ be an $n$-element point set in the $k$-dimensional unit cube $[0,1]^k$ where $k \geq 2$. According to an old result of Bollob\'as and Meir (1992), there exists a cycle (tour) $x_1, x_2, \ldots, x_n$ through the $n$ points, such that $\left(\sum_{i=1}^n |x_i - x_{i+1}|^k \right)^{1/k} \leq c_k$, where $|x-y|$ is the Euclidean distance between $x$ and $y$, and $c_k$ is an absolute constant that depends only on $k$, where $x_{n+1} \equiv x_1$. From the other direction, for every $k \geq 2$ and $n \geq 2$, there exist $n$ points in $[0,1]^k$, such that their shortest tour satisfies $\left(\sum_{i=1}^n |x_i - x_{i+1}|^k \right)^{1/k} = 2^{1/k} \cdot \sqrt{k}$. For the plane, the best constant is $c_2=2$ and this is the only exact value known. Bollob{\'a}s and Meir showed that one can take $c_k = 9 \left(\frac23 \right)^{1/k} \cdot \sqrt{k}$ for every $k \geq 3$ and conjectured that the best constant is $c_k = 2^{1/k} \cdot \sqrt{k}$, for every $k \geq 2$. Here we significantly improve the upper bound and show that one can take $c_k = 3 \sqrt5 \left(\frac23 \right)^{1/k} \cdot \sqrt{k}$ or $c_k = 2.91 \sqrt{k} \ (1+o_k(1))$. Our bounds are constructive. We also show that $c_3 \geq 2^{7/6}$, which disproves the conjecture for $k=3$. Connections to matching problems, power assignment problems, related problems, including algorithms, are discussed in this context. A slightly revised version of the Bollob\'as--Meir conjecture is proposed.
In multi-turn dialog, utterances do not always take the full form of sentences \cite{Carbonell1983DiscoursePA}, which naturally makes understanding the dialog context more difficult. However, it is essential to fully grasp the dialog context to generate a reasonable response. Hence, in this paper, we propose to improve the response generation performance by examining the model's ability to answer a reading comprehension question, where the question is focused on the omitted information in the dialog. Enlightened by the multi-task learning scheme, we propose a joint framework that unifies these two tasks, sharing the same encoder to extract the common and task-invariant features with different decoders to learn task-specific features. To better fusing information from the question and the dialog history in the encoding part, we propose to augment the Transformer architecture with a memory updater, which is designed to selectively store and update the history dialog information so as to support downstream tasks. For the experiment, we employ human annotators to write and examine a large-scale dialog reading comprehension dataset. Extensive experiments are conducted on this dataset, and the results show that the proposed model brings substantial improvements over several strong baselines on both tasks. In this way, we demonstrate that reasoning can indeed help better response generation and vice versa. We release our large-scale dataset for further research.