We investigate the consequence of two Lip$(\gamma)$ functions, in the sense of Stein, being close throughout a subset of their domain. A particular consequence of our results is the following. Given $K_0 > \varepsilon > 0$ and $\gamma > \eta > 0$ there is a constant $\delta = \delta(\gamma,\eta,\varepsilon,K_0) > 0$ for which the following is true. Let $\Sigma \subset \mathbb{R}^d$ be closed and $f , h : \Sigma \to \mathbb{R}$ be Lip$(\gamma)$ functions whose Lip$(\gamma)$ norms are both bounded above by $K_0$. Suppose $B \subset \Sigma$ is closed and that $f$ and $h$ coincide throughout $B$. Then over the set of points in $\Sigma$ whose distance to $B$ is at most $\delta$ we have that the Lip$(\eta)$ norm of the difference $f-h$ is bounded above by $\varepsilon$. More generally, we establish that this phenomenon remains valid in a less restrictive Banach space setting under the weaker hypothesis that the two Lip$(\gamma)$ functions $f$ and $h$ are only close in a pointwise sense throughout the closed subset $B$. We require only that the subset $\Sigma$ be closed; in particular, the case that $\Sigma$ is finite is covered by our results. The restriction that $\eta < \gamma$ is sharp in the sense that our result is false for $\eta := \gamma$.
Modern compression methods can summarize a target distribution $\mathbb{P}$ more succinctly than i.i.d. sampling but require access to a low-bias input sequence like a Markov chain converging quickly to $\mathbb{P}$. We introduce a new suite of compression methods suitable for compression with biased input sequences. Given $n$ points targeting the wrong distribution and quadratic time, Stein kernel thinning (SKT) returns $\sqrt{n}$ equal-weighted points with $\widetilde{O}(n^{-1/2})$ maximum mean discrepancy (MMD) to $\mathbb{P}$. For larger-scale compression tasks, low-rank SKT achieves the same feat in sub-quadratic time using an adaptive low-rank debiasing procedure that may be of independent interest. For downstream tasks that support simplex or constant-preserving weights, Stein recombination and Stein Cholesky achieve even greater parsimony, matching the guarantees of SKT with as few as $\text{poly-log}(n)$ weighted points. Underlying these advances are new guarantees for the quality of simplex-weighted coresets, the spectral decay of kernel matrices, and the covering numbers of Stein kernel Hilbert spaces. In our experiments, our techniques provide succinct and accurate posterior summaries while overcoming biases due to burn-in, approximate Markov chain Monte Carlo, and tempering.
Consider the expected query complexity of computing the $k$-fold direct product $f^{\otimes k}$ of a function $f$ to error $\varepsilon$ with respect to a distribution $\mu^k$. One strategy is to sequentially compute each of the $k$ copies to error $\varepsilon/k$ with respect to $\mu$ and apply the union bound. We prove a strong direct sum theorem showing that this naive strategy is essentially optimal. In particular, computing a direct product necessitates a blowup in both query complexity and error. Strong direct sum theorems contrast with results that only show a blowup in query complexity or error but not both. There has been a long line of such results for distributional query complexity, dating back to (Impagliazzo, Raz, Wigderson 1994) and (Nisan, Rudich, Saks 1994), but a strong direct sum theorem had been elusive. A key idea in our work is the first use of the Hardcore Theorem (Impagliazzo 1995) in the context of query complexity. We prove a new "resilience lemma" that accompanies it, showing that the hardcore of $f^{\otimes k}$ is likely to remain dense under arbitrary partitions of the input space.
Recent theoretical developments in coset coding theory have provided continuous-valued functions which give the equivocation and maximum likelihood (ML) decoding probability of coset secrecy codes. In this work, we develop a method for incorporating these functions, along with a complex set of constraints, into a gradient descent optimization algorithm. This algorithm employs a movement cost function and trigonometric update step to ensure that the continuous-valued code definition vector ultimately reaches a value which yields a realizable coset code. This algorithm is used to produce coset codes with blocklength up to a few thousand. These codes were compared against published codes, including both short-blocklength and capacity-achieving constructions. For most code sizes, codes generated using gradient descent outperformed all others, especially capacity-achieving constructions, which performed significantly worse than randomly-generated codes at short blocklength.
How can we trust the correctness of a learned model on a particular input of interest? Model accuracy is typically measured \emph{on average} over a distribution of inputs, giving no guarantee for any fixed input. This paper proposes a theoretically-founded solution to this problem: to train *Self-Proving models* that prove the correctness of their output to a verification algorithm $V$ via an Interactive Proof. Self-Proving models satisfy that, with high probability over a random input, the model generates a correct output \emph{and} successfully proves its correctness to $V\!$. The *soundness* property of $V$ guarantees that, for *every* input, no model can convince $V$ of the correctness of an incorrect output. Thus, a Self-Proving model proves correctness of most of its outputs, while *all* incorrect outputs (of any model) are detected by $V$. We devise a generic method for learning Self-Proving models, and we prove convergence bounds under certain assumptions. The theoretical framework and results are complemented by experiments on an arithmetic capability: computing the greatest common divisor (GCD) of two integers. Our learning method is used to train a Self-Proving transformer that computes the GCD *and* proves the correctness of its answer.
We propose a novel random walk-based algorithm for unbiased estimation of arbitrary functions of a weighted adjacency matrix, coined universal graph random features (u-GRFs). This includes many of the most popular examples of kernels defined on the nodes of a graph. Our algorithm enjoys subquadratic time complexity with respect to the number of nodes, overcoming the notoriously prohibitive cubic scaling of exact graph kernel evaluation. It can also be trivially distributed across machines, permitting learning on much larger networks. At the heart of the algorithm is a modulation function which upweights or downweights the contribution from different random walks depending on their lengths. We show that by parameterising it with a neural network we can obtain u-GRFs that give higher-quality kernel estimates or perform efficient, scalable kernel learning. We provide robust theoretical analysis and support our findings with experiments including pointwise estimation of fixed graph kernels, solving non-homogeneous graph ordinary differential equations, node clustering and kernel regression on triangular meshes.
The NP-hard scheduling problem P||C_max encompasses a set of tasks with known execution time which must be mapped to a set of identical machines such that the overall completion time is minimized. In this work, we improve existing techniques for optimal P||C_max scheduling with a combination of new theoretical insights and careful practical engineering. Most importantly, we derive techniques to prune vast portions of the search space of branch-and-bound (BnB) approaches. We also propose improved upper and lower bounding techniques which can be combined with any approach to P||C_max. Moreover, we present new benchmarks for P||C_max, based on diverse application data, which can shed light on aspects which prior synthetic instances fail to capture. In an extensive evaluation, we observe that our pruning techniques reduce the number of explored nodes by 90$\times$ and running times by 12$\times$. Compared to a state-of-the-art ILP-based approach, our approach is preferable for short running time limits and for instances with large makespans.
While operations \emph{rank} and \emph{select} on static bitvectors can be supported in constant time, lower bounds show that supporting updates raises the cost per operation to $\Theta(\log n/ \log\log n)$. This is a shame in scenarios where updates are possible but uncommon. We develop a representation of bitvectors that, if there are $q = \Omega(\log^2 n)$ queries per update, supports all the operations in $O(\log(n/q))$ amortized time. Our experimental results support the theoretical findings, displaying speedups of orders of magnitude compared to standard dynamic implementations.
A nearest neighbor representation of a Boolean function $f$ is a set of vectors (anchors) labeled by $0$ or $1$ such that $f(\vec{x}) = 1$ if and only if the closest anchor to $\vec{x}$ is labeled by $1$. This model was introduced by Hajnal, Liu, and Tur\'an (2022), who studied bounds on the number of anchors required to represent Boolean functions under different choices of anchors (real vs. Boolean vectors) as well as the more expressive model of $k$-nearest neighbors. We initiate the study of the representational power of nearest and $k$-nearest neighbors through Boolean circuit complexity. To this end, we establish a connection between Boolean functions with polynomial nearest neighbor complexity and those that can be efficiently represented by classes based on linear inequalities -- min-plus polynomial threshold functions -- previously studied in relation to threshold circuits. This extends an observation of Hajnal et al. (2022). We obtain exponential lower bounds on the $k$-nearest neighbors complexity of explicit $n$-variate functions, assuming $k \leq n^{1-\epsilon}$. Previously, no superlinear lower bound was known for any $k>1$. Next, we further extend the connection between nearest neighbor representations and circuits to the $k$-nearest neighbors case. As a result, we show that proving superpolynomial lower bounds for the $k$-nearest neighbors complexity of an explicit function for arbitrary $k$ would require a breakthrough in circuit complexity. In addition, we prove an exponential separation between the nearest neighbor and $k$-nearest neighbors complexity (for unrestricted $k$) of an explicit function. These results address questions raised by Hajnal et al. (2022) of proving strong lower bounds for $k$-nearest neighbors and understanding the role of the parameter $k$. Finally, we devise new bounds on the nearest neighbor complexity for several explicit functions.
Tight wavelet frames (TWFs) in $L^2(\mathbb{R}^n)$ are versatile and practical structures that provide the perfect reconstruction property. Nevertheless, existing TWF construction methods exhibit limitations, including a lack of specific methods for generating mother wavelets in extension-based construction, and the necessity to address the sum of squares (SOS) problem even when specific methods for generating mother wavelets are provided in SOS-based construction. It is a common practice for current TWF constructions to begin with a given refinable function. However, this approach places the entire burden on finding suitable mother wavelets. In this paper, we introduce TWF construction methods that spread the burden between both types of functions: refinable functions and mother wavelets. These construction methods offer an alternative approach to circumvent the SOS problem while providing specific techniques for generating mother wavelets. We present examples to illustrate our construction methods.
We investigate a lattice-structured LSTM model for Chinese NER, which encodes a sequence of input characters as well as all potential words that match a lexicon. Compared with character-based methods, our model explicitly leverages word and word sequence information. Compared with word-based methods, lattice LSTM does not suffer from segmentation errors. Gated recurrent cells allow our model to choose the most relevant characters and words from a sentence for better NER results. Experiments on various datasets show that lattice LSTM outperforms both word-based and character-based LSTM baselines, achieving the best results.