The Delaunay filtration $\mathcal{D}_{\bullet}(X)$ of a point cloud $X\subset \mathbb{R}^d$ is a central tool of computational topology. Its use is justified by the topological equivalence of $\mathcal{D}_{\bullet}(X)$ and the offset (i.e., union-of-balls) filtration of $X$. Given a function $\gamma: X \to \mathbb{R}$, we introduce a Delaunay bifiltration $\mathcal{DC}_{\bullet}(\gamma)$ that satisfies an analogous topological equivalence, ensuring that $\mathcal{DC}_{\bullet}(\gamma)$ topologically encodes the offset filtrations of all sublevel sets of $\gamma$, as well as the topological relations between them. $\mathcal{DC}_{\bullet}(\gamma)$ is of size $O(|X|^{\lceil\frac{d+1}{2}\rceil})$, which for $d$ odd matches the worst-case size of $\mathcal{D}_{\bullet}(X)$. Adapting the Bowyer-Watson algorithm for computing Delaunay triangulations, we give a simple, practical algorithm to compute $\mathcal{DC}_{\bullet}(\gamma)$ in time $O(|X|^{\lceil \frac{d}{2}\rceil +1})$. Our implementation, based on CGAL, computes $\mathcal{DC}_{\bullet}(\gamma)$ with modest overhead compared to computing $\mathcal{D}_{\bullet}(X)$, and handles tens of thousands of points in $\mathbb{R}^3$ within seconds.
We consider the problem of learning a graph modeling the statistical relations of the $d$ variables from a dataset with $n$ samples $X \in \mathbb{R}^{n \times d}$. Standard approaches amount to searching for a precision matrix $\Theta$ representative of a Gaussian graphical model that adequately explains the data. However, most maximum likelihood-based estimators usually require storing the $d^{2}$ values of the empirical covariance matrix, which can become prohibitive in a high-dimensional setting. In this work, we adopt a compressive viewpoint and aim to estimate a sparse $\Theta$ from a \emph{sketch} of the data, i.e. a low-dimensional vector of size $m \ll d^{2}$ carefully designed from $X$ using non-linear random features. Under certain assumptions on the spectrum of $\Theta$ (or its condition number), we show that it is possible to estimate it from a sketch of size $m=\Omega\left((d+2k)\log(d)\right)$ where $k$ is the maximal number of edges of the underlying graph. These information-theoretic guarantees are inspired by compressed sensing theory and involve restricted isometry properties and instance optimal decoders. We investigate the possibility of achieving practical recovery with an iterative algorithm based on the graphical lasso, viewed as a specific denoiser. We compare our approach and graphical lasso on synthetic datasets, demonstrating its favorable performance even when the dataset is compressed.
Bidirectional attention $\unicode{x2013}$ composed of self-attention with positional encodings and the masked language model (MLM) objective $\unicode{x2013}$ has emerged as a key component of modern large language models (LLMs). Despite its empirical success, few studies have examined its statistical underpinnings: What statistical model is bidirectional attention implicitly fitting? What sets it apart from its non-attention predecessors? We explore these questions in this paper. The key observation is that fitting a single-layer single-head bidirectional attention, upon reparameterization, is equivalent to fitting a continuous bag of words (CBOW) model with mixture-of-experts (MoE) weights. Further, bidirectional attention with multiple heads and multiple layers is equivalent to stacked MoEs and a mixture of MoEs, respectively. This statistical viewpoint reveals the distinct use of MoE in bidirectional attention, which aligns with its practical effectiveness in handling heterogeneous data. It also suggests an immediate extension to categorical tabular data, if we view each word location in a sentence as a tabular feature. Across empirical studies, we find that this extension outperforms existing tabular extensions of transformers in out-of-distribution (OOD) generalization. Finally, this statistical perspective of bidirectional attention enables us to theoretically characterize when linear word analogies are present in its word embeddings. These analyses show that bidirectional attention can require much stronger assumptions to exhibit linear word analogies than its non-attention predecessors.
Modeling of multivariate random fields through Gaussian processes calls for the construction of valid cross-covariance functions describing the dependence between any two component processes at different spatial locations. The required validity conditions often present challenges that lead to complicated restrictions on the parameter space. The purpose of this paper is to present a simplified techniques for establishing multivariate validity for the recently-introduced Confluent Hypergeometric (CH) class of covariance functions. Specifically, we use multivariate mixtures to present both simplified and comprehensive conditions for validity, based on results on conditionally negative semidefinite matrices and the Schur product theorem. In addition, we establish the spectral density of the CH covariance and use this to construct valid multivariate models as well as propose new cross-covariances. We show that our proposed approach leads to valid multivariate cross-covariance models that inherit the desired marginal properties of the CH model and outperform the multivariate Mat\'ern model in out-of-sample prediction under slowly-decaying correlation of the underlying multivariate random field. We also establish properties of multivariate CH models, including equivalence of Gaussian measures, and demonstrate their use in modeling a multivariate oceanography data set consisting of temperature, salinity and oxygen, as measured by autonomous floats in the Southern Ocean.
We study the problem of hypothesis selection under the constraint of local differential privacy. Given a class $\mathcal{F}$ of $k$ distributions and a set of i.i.d. samples from an unknown distribution $h$, the goal of hypothesis selection is to pick a distribution $\hat{f}$ whose total variation distance to $h$ is comparable with the best distribution in $\mathcal{F}$ (with high probability). We devise an $\varepsilon$-locally-differentially-private ($\varepsilon$-LDP) algorithm that uses $\Theta\left(\frac{k}{\alpha^2\min \{\varepsilon^2,1\}}\right)$ samples to guarantee that $d_{TV}(h,\hat{f})\leq \alpha + 9 \min_{f\in \mathcal{F}}d_{TV}(h,f)$ with high probability. This sample complexity is optimal for $\varepsilon<1$, matching the lower bound of Gopi et al. (2020). All previously known algorithms for this problem required $\Omega\left(\frac{k\log k}{\alpha^2\min \{ \varepsilon^2 ,1\}} \right)$ samples to work. Moreover, our result demonstrates the power of interaction for $\varepsilon$-LDP hypothesis selection. Namely, it breaks the known lower bound of $\Omega\left(\frac{k\log k}{\alpha^2\min \{ \varepsilon^2 ,1\}} \right)$ for the sample complexity of non-interactive hypothesis selection. Our algorithm breaks this barrier using only $\Theta(\log \log k)$ rounds of interaction. To prove our results, we define the notion of \emph{critical queries} for a Statistical Query Algorithm (SQA) which may be of independent interest. Informally, an SQA is said to use a small number of critical queries if its success relies on the accuracy of only a small number of queries it asks. We then design an LDP algorithm that uses a smaller number of critical queries.
The sub-packetization $\ell$ and the field size $q$ are of paramount importance in the MSR array code constructions. For optimal-access MSR codes, Balaji et al. proved that $\ell\geq s^{\left\lceil n/s \right\rceil}$, where $s = d-k+1$. Rawat et al. showed that this lower bound is attainable for all admissible values of $d$ when the field size is exponential in $n$. After that, tremendous efforts have been devoted to reducing the field size. However, till now, reduction to linear field size is only available for $d\in\{k+1,k+2,k+3\}$ and $d=n-1$. In this paper, we construct the first class of explicit optimal-access MSR codes with the smallest sub-packetization $\ell = s^{\left\lceil n/s \right\rceil}$ for all $d$ between $k+1$ and $n-1$, resolving an open problem in the survey (Ramkumar et al., Foundations and Trends in Communications and Information Theory: Vol. 19: No. 4). We further propose another class of explicit MSR code constructions (not optimal-access) with even smaller sub-packetization $s^{\left\lceil n/(s+1)\right\rceil }$ for all admissible values of $d$, making significant progress on another open problem in the survey. Previously, MSR codes with $\ell=s^{\left\lceil n/(s+1)\right\rceil }$ and $q=O(n)$ were only known for $d=k+1$ and $d=n-1$. The key insight that enables a linear field size in our construction is to reduce $\binom{n}{r}$ global constraints of non-vanishing determinants to $O_s(n)$ local ones, which is achieved by carefully designing the parity check matrices.
We consider the Distinct Shortest Walks problem. Given two vertices $s$ and $t$ of a graph database $\mathcal{D}$ and a regular path query, enumerate all walks of minimal length from $s$ to $t$ that carry a label that conforms to the query. Usual theoretical solutions turn out to be inefficient when applied to graph models that are closer to real-life systems, in particular because edges may carry multiple labels. Indeed, known algorithms may repeat the same answer exponentially many times. We propose an efficient algorithm for multi-labelled graph databases. The preprocessing runs in $O{|\mathcal{D}|\times|\mathcal{A}|}$ and the delay between two consecutive outputs is in $O(\lambda\times|\mathcal{A}|)$, where $\mathcal{A}$ is a nondeterministic automaton representing the query and $\lambda$ is the minimal length. The algorithm can handle $\varepsilon$-transitions in $\mathcal{A}$ or queries given as regular expressions at no additional cost.
We introduce a \emph{gain function} viewpoint of information leakage by proposing \emph{maximal $g$-leakage}, a rich class of operationally meaningful leakage measures that subsumes recently introduced leakage measures -- {maximal leakage} and {maximal $\alpha$-leakage}. In maximal $g$-leakage, the gain of an adversary in guessing an unknown random variable is measured using a {gain function} applied to the probability of correctly guessing. In particular, maximal $g$-leakage captures the multiplicative increase, upon observing $Y$, in the expected gain of an adversary in guessing a randomized function of $X$, maximized over all such randomized functions. We also consider the scenario where an adversary can make multiple attempts to guess the randomized function of interest. We show that maximal leakage is an upper bound on maximal $g$-leakage under multiple guesses, for any non-negative gain function $g$. We obtain a closed-form expression for maximal $g$-leakage under multiple guesses for a class of concave gain functions. We also study maximal $g$-leakage measure for a specific class of gain functions related to the $\alpha$-loss. In particular, we first completely characterize the minimal expected $\alpha$-loss under multiple guesses and analyze how the corresponding leakage measure is affected with the number of guesses. Finally, we study two variants of maximal $g$-leakage depending on the type of adversary and obtain closed-form expressions for them, which do not depend on the particular gain function considered as long as it satisfies some mild regularity conditions. We do this by developing a variational characterization for the R\'{e}nyi divergence of order infinity which naturally generalizes the definition of pointwise maximal leakage to incorporate arbitrary gain functions.
In 2023, Kuznetsov and Speranski introduced infinitary action logic with multiplexing $!^m\nabla \mathrm{ACT}_\omega$ and proved that the derivability problem for it lies between the $\omega$ and $\omega^\omega$ levels of the hyperarithmetical hierarchy. We prove that this problem is $\Delta^0_{\omega^\omega}$-complete under Turing reductions. Namely, we prove that it is recursively isomorphic to the satisfaction predicate for computable infinitary formulas of rank less than $\omega^\omega$ in the language of arithmetic. We also prove this result for the fragment of $!^m\nabla \mathrm{ACT}_\omega$ where Kleene star is not allowed to be in the scope of the subexponential. Finally, we present a family of logics, which are fragments of $!^m\nabla \mathrm{ACT}_\omega$, such that the complexity of the $k$-th logic is between $\Delta^0_{\omega^k}$ and $\Delta^0_{\omega^{k+1}}$.
We propose a new framework to design and analyze accelerated methods that solve general monotone equation (ME) problems $F(x)=0$. Traditional approaches include generalized steepest descent methods and inexact Newton-type methods. If $F$ is uniformly monotone and twice differentiable, these methods achieve local convergence rates while the latter methods are globally convergent thanks to line search and hyperplane projection. However, a global rate is unknown for these methods. The variational inequality methods can be applied to yield a global rate that is expressed in terms of $\|F(x)\|$ but these results are restricted to first-order methods and a Lipschitz continuous operator. It has not been clear how to obtain global acceleration using high-order Lipschitz continuity. This paper takes a continuous-time perspective where accelerated methods are viewed as the discretization of dynamical systems. Our contribution is to propose accelerated rescaled gradient systems and prove that they are equivalent to closed-loop control systems. Based on this connection, we establish the properties of solution trajectories. Moreover, we provide a unified algorithmic framework obtained from discretization of our system, which together with two approximation subroutines yields both existing high-order methods and new first-order methods. We prove that the $p^{th}$-order method achieves a global rate of $O(k^{-p/2})$ in terms of $\|F(x)\|$ if $F$ is $p^{th}$-order Lipschitz continuous and the first-order method achieves the same rate if $F$ is $p^{th}$-order strongly Lipschitz continuous. If $F$ is strongly monotone, the restarted versions achieve local convergence with order $p$ when $p \geq 2$. Our discrete-time analysis is largely motivated by the continuous-time analysis and demonstrates the fundamental role that rescaled gradients play in global acceleration for solving ME problems.
Neural machine translation (NMT) is a deep learning based approach for machine translation, which yields the state-of-the-art translation performance in scenarios where large-scale parallel corpora are available. Although the high-quality and domain-specific translation is crucial in the real world, domain-specific corpora are usually scarce or nonexistent, and thus vanilla NMT performs poorly in such scenarios. Domain adaptation that leverages both out-of-domain parallel corpora as well as monolingual corpora for in-domain translation, is very important for domain-specific translation. In this paper, we give a comprehensive survey of the state-of-the-art domain adaptation techniques for NMT.