This article studies the convergence properties of trans-dimensional MCMC algorithms when the total number of models is finite. It is shown that, for reversible and some non-reversible trans-dimensional Markov chains, under mild conditions, geometric convergence is guaranteed if the Markov chains associated with the within-model moves are geometrically ergodic. This result is proved in an $L^2$ framework using the technique of Markov chain decomposition. While the technique was previously developed for reversible chains, this work extends it to the point that it can be applied to some commonly used non-reversible chains. Under geometric convergence, a central limit theorem holds for ergodic averages, even in the absence of Harris ergodicity. This allows for the construction of simultaneous confidence intervals for features of the target distribution. This procedure is rigorously examined in a trans-dimensional setting, and special attention is given to the case where the asymptotic covariance matrix in the central limit theorem is singular. The theory and methodology herein are applied to reversible jump algorithms for two Bayesian models: a robust autoregression with unknown model order, and a probit regression with variable selection.
Large language models (LLMs) based on transformers have made significant strides in recent years, the success of which is driven by scaling up their model size. Despite their high algorithmic performance, the computational and memory requirements of LLMs present unprecedented challenges. To tackle the high compute requirements of LLMs, the Mixture-of-Experts (MoE) architecture was introduced which is able to scale its model size without proportionally scaling up its computational requirements. Unfortunately, MoE's high memory demands and dynamic activation of sparse experts restrict its applicability to real-world problems. Previous solutions that offload MoE's memory-hungry expert parameters to CPU memory fall short because the latency to migrate activated experts from CPU to GPU incurs high performance overhead. Our proposed Pre-gated MoE system effectively tackles the compute and memory challenges of conventional MoE architectures using our algorithm-system co-design. Pre-gated MoE employs our novel pre-gating function which alleviates the dynamic nature of sparse expert activation, allowing our proposed system to address the large memory footprint of MoEs while also achieving high performance. We demonstrate that Pre-gated MoE is able to improve performance, reduce GPU memory consumption, while also maintaining the same level of model quality. These features allow our Pre-gated MoE system to cost-effectively deploy large-scale LLMs using just a single GPU with high performance.
This paper studies the extreme singular values of non-harmonic Fourier matrices. Such a matrix of size $m\times s$ can be written as $\Phi=[ e^{-2\pi i j x_k}]_{j=0,1,\dots,m-1, k=1,2,\dots,s}$ for some set $\mathcal{X}=\{x_k\}_{k=1}^s$. The main results provide explicit lower bounds for the smallest singular value of $\Phi$ under the assumption $m\geq 6s$ and without any restrictions on $\mathcal{X}$. They show that for an appropriate scale $\tau$ determined by a density criteria, interactions between elements in $\mathcal{X}$ at scales smaller than $\tau$ are most significant and depends on the multiscale structure of $\mathcal{X}$ at fine scales, while distances larger than $\tau$ are less important and only depend on the local sparsity of the far away points. Theoretical and numerical comparisons show that the main results significantly improve upon classical bounds and achieve the same rate that was previously discovered for more restrictive settings.
For an infinite class of finite graphs of unbounded size, we define a limit object, to be called wide limit, relative to some computationally restricted class of functions. The properties of the wide limit then reflect how a computationally restricted viewer "sees" a generic instance from the class. The construction uses arithmetic forcing with random variables [10]. We prove sufficient conditions for universal and existential sentences to be valid in the limit, provide several examples, and prove that such a limit object can then be expanded to a model of weak arithmetic. We then take the wide limit of all finite pointed paths to obtain a model of arithmetic where the problem OntoWeakPigeon is total but Leaf (the complete problem for $\textbf{PPA}$) is not. This logical separation of the oracle classes of total NP search problems in our setting implies that Leaf is not reducible to OntoWeakPigeon even if some errors are allowed in the reductions.
We present here a new splitting method to solve Lyapunov equations in a Kronecker product form. Although this resulting matrix is of order $n^2$, each iteration demands two operations with the matrix $A$: a multiplication of the form $(A-\sigma I) \tilde{B}$ and a inversion of the form $(A-\sigma I)^{-1}\tilde{B}$. We see that for some choice of a parameter the iteration matrix is such that all their eigenvalues are in absolute value less than 1. Moreover we present a theorem that enables us to get a good starting vector for the method.
Neuromorphic computing is one of the few current approaches that have the potential to significantly reduce power consumption in Machine Learning and Artificial Intelligence. Imam & Cleland presented an odour-learning algorithm that runs on a neuromorphic architecture and is inspired by circuits described in the mammalian olfactory bulb. They assess the algorithm's performance in "rapid online learning and identification" of gaseous odorants and odorless gases (short "gases") using a set of gas sensor recordings of different odour presentations and corrupting them by impulse noise. We replicated parts of the study and discovered limitations that affect some of the conclusions drawn. First, the dataset used suffers from sensor drift and a non-randomised measurement protocol, rendering it of limited use for odour identification benchmarks. Second, we found that the model is restricted in its ability to generalise over repeated presentations of the same gas. We demonstrate that the task the study refers to can be solved with a simple hash table approach, matching or exceeding the reported results in accuracy and runtime. Therefore, a validation of the model that goes beyond restoring a learned data sample remains to be shown, in particular its suitability to odour identification tasks.
We make two contributions to the Isolation Forest method for anomaly and outlier detection. The first contribution is an information-theoretically motivated generalisation of the score function that is used to aggregate the scores across random tree estimators. This generalisation allows one to take into account not just the ensemble average across trees but instead the whole distribution. The second contribution is an alternative scoring function at the level of the individual tree estimator, in which we replace the depth-based scoring of the Isolation Forest with one based on hyper-volumes associated to an isolation tree's leaf nodes. We motivate the use of both of these methods on generated data and also evaluate them on 34 datasets from the recent and exhaustive ``ADBench'' benchmark, finding significant improvement over the standard isolation forest for both variants on some datasets and improvement on average across all datasets for one of the two variants. The code to reproduce our results is made available as part of the submission.
Many protocols in distributed computing rely on a source of randomness, usually called a random beacon, both for their applicability and security. This is especially true for proof-of-stake blockchain protocols in which the next miner or set of miners have to be chosen randomly and each party's likelihood to be selected is in proportion to their stake in the cryptocurrency. Current random beacons used in proof-of-stake protocols, such as Ouroboros and Algorand, have two fundamental limitations: Either (i)~they rely on pseudorandomness, e.g.~assuming that the output of a hash function is uniform, which is a widely-used but unproven assumption, or (ii)~they generate their randomness using a distributed protocol in which several participants are required to submit random numbers which are then used in the generation of a final random result. However, in this case, there is no guarantee that the numbers provided by the parties are uniformly random and there is no incentive for the parties to honestly generate uniform randomness. Most random beacons have both limitations. In this thesis, we provide a protocol for distributed generation of randomness. Our protocol does not rely on pseudorandomness at all. Similar to some of the previous approaches, it uses random inputs by different participants to generate a final random result. However, the crucial difference is that we provide a game-theoretic guarantee showing that it is in everyone's best interest to submit uniform random numbers. Hence, our approach is the first to incentivize honest behavior instead of just assuming it. Moreover, the approach is trustless and generates unbiased random numbers. It is also tamper-proof and no party can change the output or affect its distribution. Finally, it is designed with modularity in mind and can be easily plugged into existing distributed protocols such as proof-of-stake blockchains.
The study further explores randomized QMC (RQMC), which maintains the QMC convergence rate and facilitates computational efficiency analysis. Emphasis is laid on integrating randomly shifted lattice rules, a distinct RQMC quadrature, with IS,a classic variance reduction technique. The study underscores the intricacies of establishing a theoretical convergence rate for IS in QMC compared to MC, given the influence of problem dimensions and smoothness on QMC. The research also touches on the significance of IS density selection and its potential implications. The study culminates in examining the error bound of IS with a randomly shifted lattice rule, drawing inspiration from the reproducing kernel Hilbert space (RKHS). In the realm of finance and statistics, many problems boil down to computing expectations, predominantly integrals concerning a Gaussian measure. This study considers optimal drift importance sampling (ODIS) and Laplace importance sampling (LapIS) as common importance densities. Conclusively, the paper establishes that under certain conditions, the IS-randomly shifted lattice rule can achieve a near $O(N^{-1})$ error bound.
We propose an approach to compute inner and outer-approximations of the sets of values satisfying constraints expressed as arbitrarily quantified formulas. Such formulas arise for instance when specifying important problems in control such as robustness, motion planning or controllers comparison. We propose an interval-based method which allows for tractable but tight approximations. We demonstrate its applicability through a series of examples and benchmarks using a prototype implementation.
Contrastive loss has been increasingly used in learning representations from multiple modalities. In the limit, the nature of the contrastive loss encourages modalities to exactly match each other in the latent space. Yet it remains an open question how the modality alignment affects the downstream task performance. In this paper, based on an information-theoretic argument, we first prove that exact modality alignment is sub-optimal in general for downstream prediction tasks. Hence we advocate that the key of better performance lies in meaningful latent modality structures instead of perfect modality alignment. To this end, we propose three general approaches to construct latent modality structures. Specifically, we design 1) a deep feature separation loss for intra-modality regularization; 2) a Brownian-bridge loss for inter-modality regularization; and 3) a geometric consistency loss for both intra- and inter-modality regularization. Extensive experiments are conducted on two popular multi-modal representation learning frameworks: the CLIP-based two-tower model and the ALBEF-based fusion model. We test our model on a variety of tasks including zero/few-shot image classification, image-text retrieval, visual question answering, visual reasoning, and visual entailment. Our method achieves consistent improvements over existing methods, demonstrating the effectiveness and generalizability of our proposed approach on latent modality structure regularization.