For statistical inference on clustering, the mixture model-based framework is very popular. On the one hand, the model-based framework is convenient for producing probabilistic estimates of cluster assignments and uncertainty. On the other hand, the specification of a mixture model is fraught with the danger of misspecification that could lead to inconsistent clustering estimates. Graphical model-based clustering takes a different model specification strategy, in which the likelihood treats the data as arising dependently from a disjoint union of component graphs. To counter the large uncertainty of the graph, recent work on Bayesian spanning forest proposes using the integrated posterior of the node partition (marginalized over the latent edge distribution) to produce probabilistic estimates for clustering. Despite the strong empirical performance, it is not yet known whether the clustering estimator is consistent, especially when the data-generating mechanism is different from the specified graphical model. This article gives a positive answer in the asymptotic regime: when the data arise from an unknown mixture distribution, under mild conditions, the posterior concentrates on the ground-truth partition, producing correct clustering estimates including the number of clusters. This theoretical result is an encouraging development for the robust clustering literature, demonstrating the use of graphical models as a robust alternative to mixture models in model-based clustering.
The development of large language models (LLMs) has expanded to multi-modal systems capable of processing text, images, and speech within a unified framework. Training these models demands significantly larger datasets and computational resources compared to text-only LLMs. To address the scaling challenges, we introduce Mixture-of-Transformers (MoT), a sparse multi-modal transformer architecture that significantly reduces pretraining computational costs. MoT decouples non-embedding parameters of the model by modality -- including feed-forward networks, attention matrices, and layer normalization -- enabling modality-specific processing with global self-attention over the full input sequence. We evaluate MoT across multiple settings and model scales. In the Chameleon 7B setting (autoregressive text-and-image generation), MoT matches the dense baseline's performance using only 55.8\% of the FLOPs. When extended to include speech, MoT reaches speech performance comparable to the dense baseline with only 37.2\% of the FLOPs. In the Transfusion setting, where text and image are trained with different objectives, a 7B MoT model matches the image modality performance of the dense baseline with one third of the FLOPs, and a 760M MoT model outperforms a 1.4B dense baseline across key image generation metrics. System profiling further highlights MoT's practical benefits, achieving dense baseline image quality in 47.2\% of the wall-clock time and text quality in 75.6\% of the wall-clock time (measured on AWS p4de.24xlarge instances with NVIDIA A100 GPUs).
The immersed interface method (IIM) for models of fluid flow and fluid-structure interaction imposes jump conditions that capture stress discontinuities generated by forces that are concentrated along immersed boundaries. Most prior work using the IIM for fluid dynamic applications has focused on smooth interfaces, but boundaries with sharp features such as corners and edges can appear in practical analyses, particularly on engineered structures. The present study builds on our work to integrate finite element-type representations of interface geometries with the IIM. Initial realizations of this approach used a continuous Galerkin (CG) finite element discretization for the boundary, but as we show herein, these approaches generate large errors near sharp geometrical features. To overcome this difficulty, this study introduces an IIM approach using discontinuous Galerkin (DG) representation of the jump conditions. Numerical examples explore the impacts of different interface representations on accuracy for both smooth and sharp boundaries, particularly flows interacting with fixed interface configurations. We demonstrate that using a DG approach provides accuracy that is comparable to the CG method for smooth cases. Further, we identify a time step size restriction for the CG representation that is directly related to the sharpness of the geometry. In contrast, time step size restrictions imposed by DG representations are demonstrated to be insensitive to the presence of sharp features.
Diffusion models (DMs) have recently shown outstanding capabilities in modeling complex image distributions, making them expressive image priors for solving Bayesian inverse problems. However, most existing DM-based methods rely on approximations in the generative process to be generic to different inverse problems, leading to inaccurate sample distributions that deviate from the target posterior defined within the Bayesian framework. To harness the generative power of DMs while avoiding such approximations, we propose a Markov chain Monte Carlo algorithm that performs posterior sampling for general inverse problems by reducing it to sampling the posterior of a Gaussian denoising problem. Crucially, we leverage a general DM formulation as a unified interface that allows for rigorously solving the denoising problem with a range of state-of-the-art DMs. We demonstrate the effectiveness of the proposed method on six inverse problems (three linear and three nonlinear), including a real-world black hole imaging problem. Experimental results indicate that our proposed method offers more accurate reconstructions and posterior estimation compared to existing DM-based imaging inverse methods.
By leveraging both texts and images, large vision language models (LVLMs) have shown significant progress in various multi-modal tasks. Nevertheless, these models often suffer from hallucinations, e.g., they exhibit inconsistencies between the visual input and the textual output. To address this, we propose H-POPE, a coarse-to-fine-grained benchmark that systematically assesses hallucination in object existence and attributes. Our evaluation shows that models are prone to hallucinations on object existence, and even more so on fine-grained attributes. We further investigate whether these models rely on visual input to formulate the output texts.
We prove an existence result for the steady state flow of gas mixtures on networks. The basis of the model are the physical principles of the isothermal Euler equation, coupling conditions for the flow and pressure, and the mixing of incoming flow at nodes. The state equation is based on a convex combination of the ideal gas equations of state for natural gas and hydrogen. We analyze mathematical properties of the model allowing us to prove the existence of solutions in particular for tree-shaped networks and networks with exactly one cycle. Numerical examples illustrate the results and explore the applicability of our approach to different network topologies.
We consider the problem of estimating the number of clusters (k) in a dataset. We propose a non-parametric approach to the problem that utilizes similarity graphs to construct a robust statistic that effectively captures similarity information among observations. This graph-based statistic is applicable to datasets of any dimension, is computationally efficient to obtain, and can be paired with any kind of clustering technique. Asymptotic theory is developed to establish the selection consistency of the proposed approach. Simulation studies demonstrate that the graph-based statistic outperforms existing methods for estimating k, especially in the high-dimensional setting. We illustrate its utility on an imaging dataset and an RNA-seq dataset.
In this paper, a two-stage intelligent scheduler is proposed to minimize the packet-level delay jitter while guaranteeing delay bound. Firstly, Lyapunov technology is employed to transform the delay-violation constraint into a sequential slot-level queue stability problem. Secondly, a hierarchical scheme is proposed to solve the resource allocation between multiple base stations and users, where the multi-agent reinforcement learning (MARL) gives the user priority and the number of scheduled packets, while the underlying scheduler allocates the resource. Our proposed scheme achieves lower delay jitter and delay violation rate than the Round-Robin Earliest Deadline First algorithm and MARL with delay violation penalty.
Before deploying outputs from foundation models in high-stakes tasks, it is imperative to ensure that they align with human values. For instance, in radiology report generation, reports generated by a vision-language model must align with human evaluations before their use in medical decision-making. This paper presents Conformal Alignment, a general framework for identifying units whose outputs meet a user-specified alignment criterion. It is guaranteed that on average, a prescribed fraction of selected units indeed meet the alignment criterion, regardless of the foundation model or the data distribution. Given any pre-trained model and new units with model-generated outputs, Conformal Alignment leverages a set of reference data with ground-truth alignment status to train an alignment predictor. It then selects new units whose predicted alignment scores surpass a data-dependent threshold, certifying their corresponding outputs as trustworthy. Through applications to question answering and radiology report generation, we demonstrate that our method is able to accurately identify units with trustworthy outputs via lightweight training over a moderate amount of reference data. En route, we investigate the informativeness of various features in alignment prediction and combine them with standard models to construct the alignment predictor.
In this work, we study probability functions associated with Gaussian mixture models. Our primary focus is on extending the use of spherical radial decomposition for multivariate Gaussian random vectors to the context of Gaussian mixture models, which are not inherently spherical but only conditionally so. Specifically, the conditional probability distribution, given a random parameter of the random vector, follows a Gaussian distribution, allowing us to apply Bayesian analysis tools to the probability function. This assumption, together with spherical radial decomposition for Gaussian random vectors, enables us to represent the probability function as an integral over the Euclidean sphere. Using this representation, we establish sufficient conditions to ensure the differentiability of the probability function and provide and integral representation of its gradient. Furthermore, leveraging the Bayesian decomposition, we approximate the probability function using random sampling over the parameter space and the Euclidean sphere. Finally, we present numerical examples that illustrate the advantages of this approach over classical approximations based on random vector sampling.
Agent-based modeling and simulation has evolved as a powerful tool for modeling complex systems, offering insights into emergent behaviors and interactions among diverse agents. Integrating large language models into agent-based modeling and simulation presents a promising avenue for enhancing simulation capabilities. This paper surveys the landscape of utilizing large language models in agent-based modeling and simulation, examining their challenges and promising future directions. In this survey, since this is an interdisciplinary field, we first introduce the background of agent-based modeling and simulation and large language model-empowered agents. We then discuss the motivation for applying large language models to agent-based simulation and systematically analyze the challenges in environment perception, human alignment, action generation, and evaluation. Most importantly, we provide a comprehensive overview of the recent works of large language model-empowered agent-based modeling and simulation in multiple scenarios, which can be divided into four domains: cyber, physical, social, and hybrid, covering simulation of both real-world and virtual environments. Finally, since this area is new and quickly evolving, we discuss the open problems and promising future directions.