We present a novel multilevel Monte Carlo approach for estimating quantities of interest for stochastic partial differential equations (SPDEs). Drawing inspiration from [Giles and Szpruch: Antithetic multilevel Monte Carlo estimation for multi-dimensional SDEs without L\'evy area simulation, Annals of Appl. Prob., 2014], we extend the antithetic Milstein scheme for finite-dimensional stochastic differential equations to Hilbert space-valued SPDEs. Our method has the advantages of both Euler and Milstein discretizations, as it is easy to implement and does not involve intractable L\'evy area terms. Moreover, the antithetic correction in our method leads to the same variance decay in a MLMC algorithm as the standard Milstein method, resulting in significantly lower computational complexity than a corresponding MLMC Euler scheme. Our approach is applicable to a broader range of non-linear diffusion coefficients and does not require any commutative properties. The key component of our MLMC algorithm is a truncated Milstein-type time stepping scheme for SPDEs, which accelerates the rate of variance decay in the MLMC method when combined with an antithetic coupling on the fine scales. We combine the truncated Milstein scheme with appropriate spatial discretizations and noise approximations on all scales to obtain a fully discrete scheme and show that the antithetic coupling does not introduce an additional bias.
Probabilistic diffusion models enjoy increasing popularity in the deep learning community. They generate convincing samples from a learned distribution of input images with a wide field of practical applications. Originally, these approaches were motivated from drift-diffusion processes, but these origins find less attention in recent, practice-oriented publications. We investigate probabilistic diffusion models from the viewpoint of scale-space research and show that they fulfil generalised scale-space properties on evolving probability distributions. Moreover, we discuss similarities and differences between interpretations of the physical core concept of drift-diffusion in the deep learning and model-based world. To this end, we examine relations of probabilistic diffusion to osmosis filters.
We propose a novel neural speaker diarization system using memory-aware multi-speaker embedding with sequence-to-sequence architecture (NSD-MS2S), which integrates the strengths of memory-aware multi-speaker embedding (MA-MSE) and sequence-to-sequence (Seq2Seq) architecture, leading to improvement in both efficiency and performance. Next, we further decrease the memory occupation of decoding by incorporating input features fusion and then employ a multi-head attention mechanism to capture features at different levels. NSD-MS2S achieved a macro diarization error rate (DER) of 15.9% on the CHiME-7 EVAL set, which signifies a relative improvement of 49% over the official baseline system, and is the key technique for us to achieve the best performance for the main track of CHiME-7 DASR Challenge. Additionally, we introduce a deep interactive module (DIM) in MA-MSE module to better retrieve a cleaner and more discriminative multi-speaker embedding, enabling the current model to outperform the system we used in the CHiME-7 DASR Challenge. Our code will be available at //github.com/liyunlongaaa/NSD-MS2S.
There has been an enormous interest in analysing and modelling periodic time series. The research on periodically integrated autoregressive (PIAR) models which capture the periodic structure and the presence of unit roots is widely applied in environmental, financial and energy areas. In this paper, we propose a multi-companion method which uses the eigen information of the multi-companion matrix in the multi-companion representation of PIAR models. The method enables the estimation and forecasting of PIAR models with a single, two and multiple unit roots. We show that the parameters of PIAR models can be represented in terms of the eigen information of the multi-companion matrix. Consequently, the estimation can be conducted using the eigen information, rather than directly estimating the parameters of PIAR models. A Monte Carlo experiment and an application are provided to illustrate the robustness and effectiveness of the multi-companion method.
One main challenge for implementing intelligent reflecting surface (IRS) aided communications lies in the difficulty to obtain the channel knowledge for the base station (BS)-IRS-user cascaded links, which is needed to design high-performance IRS reflection in practice. Traditional methods for estimating IRS cascaded channels are usually based on the additional pilot signals received at the BS/users, which increase the system training overhead and also may not be compatible with the current communication protocols. To tackle this challenge, we propose in this paper a new single-layer neural network (NN)-enabled IRS channel estimation method based on only the knowledge of users' individual received signal power measurements corresponding to different IRS random training reflections, which are easily accessible in current wireless systems. To evaluate the effectiveness of the proposed channel estimation method, we design the IRS reflection for data transmission based on the estimated cascaded channels in an IRS-aided multiuser communication system. Numerical results show that the proposed IRS channel estimation and reflection design can significantly improve the minimum received signal-to-noise ratio (SNR) among all users, as compared to existing power measurement based designs.
Despite the promising progress in multi-modal tasks, current large multi-modal models (LMM) are prone to hallucinating inconsistent descriptions with respect to the associated image and human instructions. This paper addresses this issue by introducing the first large and diverse visual instruction tuning dataset, named Large-scale Robust Visual (LRV)-Instruction. Our dataset consists of 120k visual instructions generated by GPT4, covering 16 vision-and-language tasks with open-ended instructions and answers. Unlike existing studies that primarily focus on positive instruction samples, we design LRV-Instruction to include both positive and negative instructions for more robust visual instruction tuning. Our negative instructions are designed at two semantic levels: (i) Nonexistent Element Manipulation and (ii) Existent Element Manipulation. To efficiently measure the hallucination generated by LMMs, we propose GPT4-Assisted Visual Instruction Evaluation (GAVIE), a novel approach to evaluate visual instruction tuning without the need for human-annotated groundtruth answers and can adapt to diverse instruction formats. We conduct comprehensive experiments to investigate the hallucination of LMMs. Our results demonstrate that existing LMMs exhibit significant hallucination when presented with our negative instructions, particularly with Existent Element Manipulation instructions. Moreover, by finetuning MiniGPT4 on LRV-Instruction, we successfully mitigate hallucination while improving performance on public datasets using less training data compared to state-of-the-art methods. Additionally, we observed that a balanced ratio of positive and negative instances in the training data leads to a more robust model. Updates of our project are available at //fuxiaoliu.github.io/LRV/.
Self-supervised learning methods have achieved promising performance for anomalous sound detection (ASD) under domain shift, where the type of domain shift is considered in feature learning by incorporating section IDs. However, the attributes accompanying audio files under each section, such as machine operating conditions and noise types, have not been considered, although they are also crucial for characterizing domain shifts. In this paper, we present a hierarchical metadata information constrained self-supervised (HMIC) ASD method, where the hierarchical relation between section IDs and attributes is constructed, and used as constraints to obtain finer feature representation. In addition, we propose an attribute-group-center (AGC)-based method for calculating the anomaly score under the domain shift condition. Experiments are performed to demonstrate its improved performance over the state-of-the-art self-supervised methods in DCASE 2022 challenge Task 2.
We consider a decentralized formulation of the active hypothesis testing (AHT) problem, where multiple agents gather noisy observations from the environment with the purpose of identifying the correct hypothesis. At each time step, agents have the option to select a sampling action. These different actions result in observations drawn from various distributions, each associated with a specific hypothesis. The agents collaborate to accomplish the task, where message exchanges between agents are allowed over a rate-limited communications channel. The objective is to devise a multi-agent policy that minimizes the Bayes risk. This risk comprises both the cost of sampling and the joint terminal cost incurred by the agents upon making a hypothesis declaration. Deriving optimal structured policies for AHT problems is generally mathematically intractable, even in the context of a single agent. As a result, recent efforts have turned to deep learning methodologies to address these problems, which have exhibited significant success in single-agent learning scenarios. In this paper, we tackle the multi-agent AHT formulation by introducing a novel algorithm rooted in the framework of deep multi-agent reinforcement learning. This algorithm, named Multi-Agent Reinforcement Learning for AHT (MARLA), operates at each time step by having each agent map its state to an action (sampling rule or stopping rule) using a trained deep neural network with the goal of minimizing the Bayes risk. We present a comprehensive set of experimental results that effectively showcase the agents' ability to learn collaborative strategies and enhance performance using MARLA. Furthermore, we demonstrate the superiority of MARLA over single-agent learning approaches. Finally, we provide an open-source implementation of the MARLA framework, for the benefit of researchers and developers in related domains.
Unsupervised contrastive learning methods have recently seen significant improvements, particularly through data augmentation strategies that aim to produce robust and generalizable representations. However, prevailing data augmentation methods, whether hand designed or based on foundation models, tend to rely heavily on prior knowledge or external data. This dependence often compromises their effectiveness and efficiency. Furthermore, the applicability of most existing data augmentation strategies is limited when transitioning to other research domains, especially science-related data. This limitation stems from the paucity of prior knowledge and labeled data available in these domains. To address these challenges, we introduce DiffAug-a novel and efficient Diffusion-based data Augmentation technique. DiffAug aims to ensure that the augmented and original data share a smoothed latent space, which is achieved through diffusion steps. Uniquely, unlike traditional methods, DiffAug first mines sufficient prior semantic knowledge about the neighborhood. This provides a constraint to guide the diffusion steps, eliminating the need for labels, external data/models, or prior knowledge. Designed as an architecture-agnostic framework, DiffAug provides consistent improvements. Specifically, it improves image classification and clustering accuracy by 1.6%~4.5%. When applied to biological data, DiffAug improves performance by up to 10.1%, with an average improvement of 5.8%. DiffAug shows good performance in both vision and biological domains.
We present a large-scale study on unsupervised spatiotemporal representation learning from videos. With a unified perspective on four recent image-based frameworks, we study a simple objective that can easily generalize all these methods to space-time. Our objective encourages temporally-persistent features in the same video, and in spite of its simplicity, it works surprisingly well across: (i) different unsupervised frameworks, (ii) pre-training datasets, (iii) downstream datasets, and (iv) backbone architectures. We draw a series of intriguing observations from this study, e.g., we discover that encouraging long-spanned persistency can be effective even if the timespan is 60 seconds. In addition to state-of-the-art results in multiple benchmarks, we report a few promising cases in which unsupervised pre-training can outperform its supervised counterpart. Code is made available at //github.com/facebookresearch/SlowFast
Multi-relation Question Answering is a challenging task, due to the requirement of elaborated analysis on questions and reasoning over multiple fact triples in knowledge base. In this paper, we present a novel model called Interpretable Reasoning Network that employs an interpretable, hop-by-hop reasoning process for question answering. The model dynamically decides which part of an input question should be analyzed at each hop; predicts a relation that corresponds to the current parsed results; utilizes the predicted relation to update the question representation and the state of the reasoning process; and then drives the next-hop reasoning. Experiments show that our model yields state-of-the-art results on two datasets. More interestingly, the model can offer traceable and observable intermediate predictions for reasoning analysis and failure diagnosis.