Multiple-try Metropolis (MTM) is a popular Markov chain Monte Carlo method with the appealing feature of being amenable to parallel computing. At each iteration, it samples several candidates for the next state of the Markov chain and randomly selects one of them based on a weight function. The canonical weight function is proportional to the target density. We show both theoretically and empirically that this weight function induces pathological behaviours in high dimensions, especially during the convergence phase. We propose to instead use weight functions akin to the locally-balanced proposal distributions of Zanella (2020), thus yielding MTM algorithms that do not exhibit those pathological behaviours. To theoretically analyse these algorithms, we study the high-dimensional performance of ideal schemes that can be thought of as MTM algorithms which sample an infinite number of candidates at each iteration, as well as the discrepancy between such schemes and the MTM algorithms which sample a finite number of candidates. Our analysis unveils a strong distinction between the convergence and stationary phases: in the former, local balancing is crucial and effective to achieve fast convergence, while in the latter, the canonical and novel weight functions yield similar performance. Numerical experiments include an application in precision medicine involving a computationally-expensive forward model, which makes the use of parallel computing within MTM iterations beneficial.
There has been significant progress in the study of sampling discretization of integral norms for both a designated finite-dimensional function space and a finite collection of such function spaces (universal discretization). Sampling discretization results turn out to be very useful in various applications, particularly in sampling recovery. Recent sampling discretization results typically provide existence of good sampling points for discretization. In this paper, we show that independent and identically distributed random points provide good universal discretization with high probability. Furthermore, we demonstrate that a simple greedy algorithm based on those points that are good for universal discretization provides excellent sparse recovery results in the square norm.
Evaluating the predictive performance of a statistical model is commonly done using cross-validation. Although the leave-one-out method is frequently employed, its application is justified primarily for independent and identically distributed observations. However, this method tends to mimic interpolation rather than prediction when dealing with dependent observations. This paper proposes a modified cross-validation for dependent observations. This is achieved by excluding an automatically determined set of observations from the training set to mimic a more reasonable prediction scenario. Also, within the framework of latent Gaussian models, we illustrate a method to adjust the joint posterior for this modified cross-validation to avoid model refitting. This new approach is accessible in the R-INLA package (www.r-inla.org).
The quadratic complexity of the attention module makes it gradually become the bulk of compute in Transformer-based LLMs during generation. Moreover, the excessive key-value cache that arises when dealing with long inputs also brings severe issues on memory footprint and inference latency. In this work, we propose a plug-and-play approach that is able to incrementally compress the intermediate activation of a specified span of tokens into compact ones, thereby reducing both memory and computational cost when processing subsequent context. Experiments on both in-domain language modeling and zero-shot open-ended document generation demonstrate the advantage of our approach over sparse attention baselines in terms of fluency, n-gram matching, and semantic similarity. At last, we comprehensively profile the benefit of context compression on improving the system throughout. Code is available at //github.com/DRSY/KV_Compression.
We introduce a general differentiable solver for time-dependent deformation problems with contact and friction. Our approach uses a finite element discretization with a high-order time integrator coupled with the recently proposed incremental potential contact method for handling contact and friction forces to solve PDE- and ODE-constrained optimization problems on scenes with a complex geometry. It support static and dynamic problems and differentiation with respect to all physical parameters involved in the physical problem description, which include shape, material parameters, friction parameters, and initial conditions. Our analytically derived adjoint formulation is efficient, with a small overhead (typically less than 10% for nonlinear problems) over the forward simulation, and shares many similarities with the forward problem, allowing the reuse of large parts of existing forward simulator code. We implement our approach on top of the open-source PolyFEM library, and demonstrate the applicability of our solver to shape design, initial condition optimization, and material estimation on both simulated results and in physical validations.
Time-Aware Shaper (TAS) is a time-triggered scheduling mechanism that ensures bounded latency for time-critical Scheduled Traffic (ST) flows. The Linux kernel implementation (a.k.a TAPRIO) has limited capabilities due to varying CPU workloads and thus does not offer tight latency bound for the ST flows. Also, currently only higher cycle times are possible. Other software implementations are limited to simulation studies without physical implementation. In this paper, we present $\mu$TAS, a MicroC-based hardware implementation of TAS onto a programmable SmartNIC. $\mu$TAS takes advantage of the parallel-processing architecture of the SmartNIC to configure the scheduling behaviour of its queues at runtime. To demonstrate the effectiveness of $\mu$TAS, we built a Time-Sensitive Networking (TSN) testbed from scratch. This consists of multiple end-hosts capable of generating ST and Best Effort (BE) flows and TSN switches equipped with SmartNICs running $\mu$TAS. Time synchronization is maintained between the switches and hosts. Our experiments demonstrate that the ST flows experience a bounded latency of the order of tens of microseconds.
We propose a Monte Carlo method to efficiently find, count, and sample abstract triangulations of a given manifold M. The method is based on a biased random walk through all possible triangulations of M (in the Pachner graph), constructed by combining (bi-stellar) moves with suitable chosen accept/reject probabilities (Metropolis-Hastings). Asymptotically, the method guarantees that samples of triangulations are drawn at random from a chosen probability. This enables us not only to sample (rare) triangulations of particular interest but also to estimate the (extremely small) probability of obtaining them when isomorphism types of triangulations are sampled uniformly at random. We implement our general method for surface triangulations and 1-vertex triangulations of 3-manifolds. To showcase its usefulness, we present a number of experiments: (a) we recover asymptotic growth rates for the number of isomorphism types of simplicial triangulations of the 2-dimensional sphere; (b) we experimentally observe that the growth rate for the number of isomorphism types of 1-vertex triangulations of the 3-dimensional sphere appears to be singly exponential in the number of their tetrahedra; and (c) we present experimental evidence that a randomly chosen isomorphism type of 1-vertex n-tetrahedra 3-sphere triangulation, for n tending to infinity, almost surely shows a fixed edge-degree distribution which decays exponentially for large degrees, but shows non-monotonic behaviour for small degrees.
A recent body of work has demonstrated that Transformer embeddings can be linearly decomposed into well-defined sums of factors, that can in turn be related to specific network inputs or components. There is however still a dearth of work studying whether these mathematical reformulations are empirically meaningful. In the present work, we study representations from machine-translation decoders using two of such embedding decomposition methods. Our results indicate that, while decomposition-derived indicators effectively correlate with model performance, variation across different runs suggests a more nuanced take on this question. The high variability of our measurements indicate that geometry reflects model-specific characteristics more than it does sentence-specific computations, and that similar training conditions do not guarantee similar vector spaces.
Bayesian cross-validation (CV) is a popular method for predictive model assessment that is simple to implement and broadly applicable. A wide range of CV schemes is available for time series applications, including generic leave-one-out (LOO) and K-fold methods, as well as specialized approaches intended to deal with serial dependence such as leave-future-out (LFO), h-block, and hv-block. Existing large-sample results show that both specialized and generic methods are applicable to models of serially-dependent data. However, large sample consistency results overlook the impact of sampling variability on accuracy in finite samples. Moreover, the accuracy of a CV scheme depends on many aspects of the procedure. We show that poor design choices can lead to elevated rates of adverse selection. In this paper, we consider the problem of identifying the regression component of an important class of models of data with serial dependence, autoregressions of order p with q exogenous regressors (ARX(p,q)), under the logarithmic scoring rule. We show that when serial dependence is present, scores computed using the joint (multivariate) density have lower variance and better model selection accuracy than the popular pointwise estimator. In addition, we present a detailed case study of the special case of ARX models with fixed autoregressive structure and variance. For this class, we derive the finite-sample distribution of the CV estimators and the model selection statistic. We conclude with recommendations for practitioners.
This paper studies ensembling in the era of Large Vision-Language Models (LVLMs). Ensembling is a classical method to combine different models to get increased performance. In the recent work on Encyclopedic-VQA the authors examine a wide variety of models to solve their task: from vanilla LVLMs, to models including the caption as extra context, to models augmented with Lens-based retrieval of Wikipedia pages. Intuitively these models are highly complementary, which should make them ideal for ensembling. Indeed, an oracle experiment shows potential gains from 48.8% accuracy (the best single model) all the way up to 67% (best possible ensemble). So it is a trivial exercise to create an ensemble with substantial real gains. Or is it?
Markov chain Monte Carlo (MCMC) is a commonly used method for approximating expectations with respect to probability distributions. Uncertainty assessment for MCMC estimators is essential in practical applications. Moreover, for multivariate functions of a Markov chain, it is important to estimate not only the auto-correlation for each component but also to estimate cross-correlations, in order to better assess sample quality, improve estimates of effective sample size, and use more effective stopping rules. Berg and Song [2022] introduced the moment least squares (momentLS) estimator, a shape-constrained estimator for the autocovariance sequence from a reversible Markov chain, for univariate functions of the Markov chain. Based on this sequence estimator, they proposed an estimator of the asymptotic variance of the sample mean from MCMC samples. In this study, we propose novel autocovariance sequence and asymptotic variance estimators for Markov chain functions with multiple components, based on the univariate momentLS estimators from Berg and Song [2022]. We demonstrate strong consistency of the proposed auto(cross)-covariance sequence and asymptotic variance matrix estimators. We conduct empirical comparisons of our method with other state-of-the-art approaches on simulated and real-data examples, using popular samplers including the random-walk Metropolis sampler and the No-U-Turn sampler from STAN.