We study when the neural tangent kernel (NTK) approximation is valid for training a model with the square loss. In the lazy training setting of Chizat et al. 2019, we show that rescaling the model by a factor of $\alpha = O(T)$ suffices for the NTK approximation to be valid until training time $T$. Our bound is tight and improves on the previous bound of Chizat et al. 2019, which required a larger rescaling factor of $\alpha = O(T^2)$.
Scalarization is a general technique that can be deployed in any multiobjective setting to reduce multiple objectives into one, such as recently in RLHF for training reward models that align human preferences. Yet some have dismissed this classical approach because linear scalarizations are known to miss concave regions of the Pareto frontier. To that end, we aim to find simple non-linear scalarizations that can explore a diverse set of $k$ objectives on the Pareto frontier, as measured by the dominated hypervolume. We show that hypervolume scalarizations with uniformly random weights are surprisingly optimal for provably minimizing the hypervolume regret, achieving an optimal sublinear regret bound of $O(T^{-1/k})$, with matching lower bounds that preclude any algorithm from doing better asymptotically. As a theoretical case study, we consider the multiobjective stochastic linear bandits problem and demonstrate that by exploiting the sublinear regret bounds of the hypervolume scalarizations, we can derive a novel non-Euclidean analysis that produces improved hypervolume regret bounds of $\tilde{O}( d T^{-1/2} + T^{-1/k})$. We support our theory with strong empirical performance of using simple hypervolume scalarizations that consistently outperforms both the linear and Chebyshev scalarizations, as well as standard multiobjective algorithms in bayesian optimization, such as EHVI.
For any two point sets $A,B \subset \mathbb{R}^d$ of size up to $n$, the Chamfer distance from $A$ to $B$ is defined as $\text{CH}(A,B)=\sum_{a \in A} \min_{b \in B} d_X(a,b)$, where $d_X$ is the underlying distance measure (e.g., the Euclidean or Manhattan distance). The Chamfer distance is a popular measure of dissimilarity between point clouds, used in many machine learning, computer vision, and graphics applications, and admits a straightforward $O(d n^2)$-time brute force algorithm. Further, the Chamfer distance is often used as a proxy for the more computationally demanding Earth-Mover (Optimal Transport) Distance. However, the \emph{quadratic} dependence on $n$ in the running time makes the naive approach intractable for large datasets. We overcome this bottleneck and present the first $(1+\epsilon)$-approximate algorithm for estimating the Chamfer distance with a near-linear running time. Specifically, our algorithm runs in time $O(nd \log (n)/\varepsilon^2)$ and is implementable. Our experiments demonstrate that it is both accurate and fast on large high-dimensional datasets. We believe that our algorithm will open new avenues for analyzing large high-dimensional point clouds. We also give evidence that if the goal is to \emph{report} a $(1+\varepsilon)$-approximate mapping from $A$ to $B$ (as opposed to just its value), then any sub-quadratic time algorithm is unlikely to exist.
In this paper, we construct an efficient linear and fully decoupled finite difference scheme for wormhole propagation with heat transmission process on staggered grids, which only requires solving a sequence of linear elliptic equations at each time step. We first derive the positivity preserving properties for the discrete porosity and its difference quotient in time, and then obtain optimal error estimates for the velocity, pressure, concentration, porosity and temperature in different norms rigorously and carefully by establishing several auxiliary lemmas for the highly coupled nonlinear system. Numerical experiments in two- and three-dimensional cases are provided to verify our theoretical results and illustrate the capabilities of the constructed method.
In 1982 Papadimitriou and Yannakakis introduced the Exact Matching problem, in which given a red and blue edge-colored graph $G$ and an integer $k$ one has to decide whether there exists a perfect matching in $G$ with exactly $k$ red edges. Even though a randomized polynomial-time algorithm for this problem was quickly found a few years later, it is still unknown today whether a deterministic polynomial-time algorithm exists. This makes the Exact Matching problem an important candidate to test the RP=P hypothesis. In this paper we focus on approximating Exact Matching. While there exists a simple algorithm that computes in deterministic polynomial-time an almost perfect matching with exactly $k$ red edges, not a lot of work focuses on computing perfect matchings with almost $k$ red edges. In fact such an algorithm for bipartite graphs running in deterministic polynomial-time was published only recently (STACS'23). It outputs a perfect matching with $k'$ red edges with the guarantee that $0.5k \leq k' \leq 1.5k$. In the present paper we aim at approximating the number of red edges without exceeding the limit of $k$ red edges. We construct a deterministic polynomial-time algorithm, which on bipartite graphs computes a perfect matching with $k'$ red edges such that $k/3 \leq k' \leq k$.
This paper examines the approximation of log-determinant for large-scale symmetric positive definite matrices. Inspired by the variance reduction technique, we split the approximation of $\log\det(A)$ into two parts. The first to compute is the trace of the projection of $\log(A)$ onto a suboptimal subspace, while the second is the trace of the projection on the corresponding orthogonal complementary space. For these two approximations, the stochastic Lanczos quadrature method is used. Furthermore, in the construction of the suboptimal subspace, we utilize a projection-cost-preserving sketch to bound the size of the Gaussian random matrix and the dimension of the suboptimal subspace. We provide a rigorous error analysis for our proposed method and explicit lower bounds for its design parameters, offering guidance for practitioners. We conduct numerical experiments to demonstrate our method's effectiveness and illustrate the quality of the derived bounds.
The Independent Cutset problem asks whether there is a set of vertices in a given graph that is both independent and a cutset. Such a problem is $\textsf{NP}$-complete even when the input graph is planar and has maximum degree five. In this paper, we first present a $\mathcal{O}^*(1.4423^{n})$-time algorithm for the problem. We also show how to compute a minimum independent cutset (if any) in the same running time. Since the property of having an independent cutset is MSO$_1$-expressible, our main results are concerned with structural parameterizations for the problem considering parameters that are not bounded by a function of the clique-width of the input. We present $\textsf{FPT}$-time algorithms for the problem considering the following parameters: the dual of the maximum degree, the dual of the solution size, the size of a dominating set (where a dominating set is given as an additional input), the size of an odd cycle transversal, the distance to chordal graphs, and the distance to $P_5$-free graphs. We close by introducing the notion of $\alpha$-domination, which allows us to identify more fixed-parameter tractable and polynomial-time solvable cases.
We consider the problem of empirical Bayes estimation for (multivariate) Poisson means. Existing solutions that have been shown theoretically optimal for minimizing the regret (excess risk over the Bayesian oracle that knows the prior) have several shortcomings. For example, the classical Robbins estimator does not retain the monotonicity property of the Bayes estimator and performs poorly under moderate sample size. Estimators based on the minimum distance and non-parametric maximum likelihood (NPMLE) methods correct these issues, but are computationally expensive with complexity growing exponentially with dimension. Extending the approach of Barbehenn and Zhao (2022), in this work we construct monotone estimators based on empirical risk minimization (ERM) that retain similar theoretical guarantees and can be computed much more efficiently. Adapting the idea of offset Rademacher complexity Liang et al. (2015) to the non-standard loss and function class in empirical Bayes, we show that the shape-constrained ERM estimator attains the minimax regret within constant factors in one dimension and within logarithmic factors in multiple dimensions.
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks. The self-attention mechanism underpinning the strength of ViTs has a quadratic complexity in both computation and memory usage. This motivates the development of approximating the self-attention at linear complexity. However, an in-depth analysis in this work reveals that existing methods are either theoretically flawed or empirically ineffective for visual recognition. We identify that their limitations are rooted in the inheritance of softmax-based self-attention during approximations, that is, normalizing the scaled dot-product between token feature vectors using the softmax function. As preserving the softmax operation challenges any subsequent linearization efforts. By this insight, a family of Softmax-Free Transformers (SOFT) are proposed. Specifically, a Gaussian kernel function is adopted to replace the dot-product similarity, enabling a full self-attention matrix to be approximated under low-rank matrix decomposition. For computational robustness, we estimate the Moore-Penrose inverse using an iterative Newton-Raphson method in the forward process only, while calculating its theoretical gradients only once in the backward process. To further expand applicability (e.g., dense prediction tasks), an efficient symmetric normalization technique is introduced. Extensive experiments on ImageNet, COCO, and ADE20K show that our SOFT significantly improves the computational efficiency of existing ViT variants. With linear complexity, much longer token sequences are permitted by SOFT, resulting in superior trade-off between accuracy and complexity. Code and models are available at //github.com/fudan-zvg/SOFT.
The goal of this work is to study waves interacting with partially immersed objects allowed to move freely in the vertical direction, and in a regime in which the propagation of the waves is described by the one dimensional Boussinesq-Abbott system. The problem can be reduced to a transmission problem for this Boussinesq system, in which the transmission conditions between the components of the domain at the left and at the right of the object are determined through the resolution of coupled forced ODEs in time satisfied by the vertical displacement of the object and the average discharge in the portion of the fluid located under the object. We propose a new extended formulation in which these ODEs are complemented by two other forced ODEs satisfied by the trace of the surface elevation at the contact points. The interest of this new extended formulation is that the forcing terms are easy to compute numerically and that the surface elevation at the contact points is furnished for free. Based on this formulation, we propose a second order scheme that involves a generalization of the MacCormack scheme with nonlocal flux and a source term, which is coupled to a second order Heun scheme for the ODEs. In order to validate this scheme, several explicit solutions for this wave-structure interaction problem are derived and can serve as benchmark for future codes. As a byproduct, our method provides a second order scheme for the generation of waves at the entrance of the numerical domain for the Boussinesq-Abbott system.
Mutual coherence is a measure of similarity between two opinions. Although the notion comes from philosophy, it is essential for a wide range of technologies, e.g., the Wahl-O-Mat system. In Germany, this system helps voters to find candidates that are the closest to their political preferences. The exact computation of mutual coherence is highly time-consuming due to the iteration over all subsets of an opinion. Moreover, for every subset, an instance of the SAT model counting problem has to be solved which is known to be a hard problem in computer science. This work is the first study to accelerate this computation. We model the distribution of the so-called confirmation values as a mixture of three Gaussians and present efficient heuristics to estimate its model parameters. The mutual coherence is then approximated with the expected value of the distribution. Some of the presented algorithms are fully polynomial-time, others only require solving a small number of instances of the SAT model counting problem. The average squared error of our best algorithm lies below 0.0035 which is insignificant if the efficiency is taken into account. Furthermore, the accuracy is precise enough to be used in Wahl-O-Mat-like systems.