Learning the distribution of data on Riemannian manifolds is crucial for modeling data from non-Euclidean space, which is required by many applications in diverse scientific fields. Yet, existing generative models on manifolds suffer from expensive divergence computation or rely on approximations of heat kernel. These limitations restrict their applicability to simple geometries and hinder scalability to high dimensions. In this work, we introduce the Riemannian Diffusion Mixture, a principled framework for building a generative diffusion process on manifolds. Instead of following the denoising approach of previous diffusion models, we construct a diffusion process using a mixture of bridge processes derived on general manifolds without requiring heat kernel estimations. We develop a geometric understanding of the mixture process, deriving the drift as a weighted mean of tangent directions to the data points that guides the process toward the data distribution. We further propose a scalable training objective for learning the mixture process that readily applies to general manifolds. Our method achieves superior performance on diverse manifolds with dramatically reduced number of in-training simulation steps for general manifolds.
Tabular data is considered the last unconquered castle of deep learning, yet the task of data stream classification is stated to be an equally important and demanding research area. Due to the temporal constraints, it is assumed that deep learning methods are not the optimal solution for application in this field. However, excluding the entire -- and prevalent -- group of methods seems rather rash given the progress that has been made in recent years in its development. For this reason, the following paper is the first to present an approach to natural language data stream classification using the sentence space method, which allows for encoding text into the form of a discrete digital signal. This allows the use of convolutional deep networks dedicated to image classification to solve the task of recognizing fake news based on text data. Based on the real-life Fakeddit dataset, the proposed approach was compared with state-of-the-art algorithms for data stream classification based on generalization ability and time complexity.
Different distribution shifts require different interventions, and algorithms must be grounded in the specific shifts they address. However, methodological development for robust algorithms typically relies on structural assumptions that lack empirical validation. Advocating for an empirically grounded data-driven approach to research, we build an empirical testbed comprising natural shifts across 5 tabular datasets and 60,000 method configurations encompassing imbalanced learning and distributionally robust optimization (DRO) methods. We find $Y|X$-shifts are most prevalent on our testbed, in stark contrast to the heavy focus on $X$ (covariate)-shifts in the ML literature. The performance of robust algorithms varies significantly over shift types, and is no better than that of vanilla methods. To understand why, we conduct an in-depth empirical analysis of DRO methods and find that although often neglected by researchers, implementation details -- such as the choice of underlying model class (e.g., XGBoost) and hyperparameter selection -- have a bigger impact on performance than the ambiguity set or its radius. To further bridge that gap between methodological research and practice, we design case studies that illustrate how such a data-driven, inductive understanding of distribution shifts can enhance both data-centric and algorithmic interventions.
Probabilistic proofs of the Johnson-Lindenstrauss lemma imply that random projection can reduce the dimension of a data set and approximately preserve pairwise distances. If a distance being approximately preserved is called a success, and the complement of this event is called a failure, then such a random projection likely results in no failures. Assuming a Gaussian random projection, the lemma is proved by showing that the no-failure probability is positive using a combination of Bonferroni's inequality and Markov's inequality. This paper modifies this proof in two ways to obtain a greater lower bound on the no-failure probability. First, Bonferroni's inequality is applied to pairs of failures instead of individual failures. Second, since a pair of projection errors has a bivariate gamma distribution, the probability of a pair of successes is bounded using an inequality from Jensen (1969). If $n$ is the number of points to be embedded and $\mu$ is the probability of a success, then this leads to an increase in the lower bound on the no-failure probability of $\frac{1}{2}\binom{n}{2}(1-\mu)^2$ if $\binom{n}{2}$ is even and $\frac{1}{2}\left(\binom{n}{2}-1\right)(1-\mu)^2$ if $\binom{n}{2}$ is odd. For example, if $n=10^5$ points are to be embedded in $k=10^4$ dimensions with a tolerance of $\epsilon=0.1$, then the improvement in the lower bound is on the order of $10^{-14}$. We also show that further improvement is possible if the inequality in Jensen (1969) extends to three successes, though we do not have a proof of this result.
Forming oral models capable of understanding the complete dynamics of the oral cavity is vital across research areas such as speech correction, designing foods for the aging population, and dentistry. Magnetic resonance imaging (MRI) technologies, capable of capturing oral data essential for creating such detailed representations, offer a powerful tool for illustrating articulatory dynamics. However, its real-time application is hindered by expense and expertise requirements. Ever advancing generative AI approaches present themselves as a way to address this barrier by leveraging multi-modal approaches for generating pseudo-MRI views. Nonetheless, this immediately sparks ethical concerns regarding the utilisation of a technology with the capability to produce MRIs from facial observations. This paper explores the ethical implications of external-to-internal correlation modeling (E2ICM). E2ICM utilises facial movements to infer internal configurations and provides a cost-effective supporting technology for MRI. In this preliminary work, we employ Pix2PixGAN to generate pseudo-MRI views from external articulatory data, demonstrating the feasibility of this approach. Ethical considerations concerning privacy, consent, and potential misuse, which are fundamental to our examination of this innovative methodology, are discussed as a result of this experimentation.
Hashing functions, which are created to provide brief and erratic digests for the message entered, are the primary cryptographic primitives used in blockchain networks. Hashing is employed in blockchain networks to create linked block lists, which offer safe and secure distributed repository storage for critical information. Due to the unique nature of the hash search problem in blockchain networks, the most parallelization of calculations is possible. This technical report presents a performance evaluation of three popular hashing algorithms Blake3, SHA-256, and SHA-512. These hashing algorithms are widely used in various applications, such as digital signatures, message authentication, and password storage. It then discusses the performance metrics used to evaluate the algorithms, such as hash rate/throughput and memory usage. The evaluation is conducted on a range of hardware platforms, including desktop and VMs. The evaluation includes synthetic benchmarks. The results of the evaluation show that Blake3 generally outperforms both SHA-256 and SHA-512 in terms of throughput and latency. However, the performance advantage of Blake3 varies depending on the specific hardware platform and the size of the input data. The report concludes with recommendations for selecting the most suitable hashing algorithm for a given application, based on its performance requirements and security needs. The evaluation results can also inform future research and development efforts to improve the performance and security of hashing algorithms.
The recent success of large language models (LLMs) trained on static, pre-collected, general datasets has sparked numerous research directions and applications. One such direction addresses the non-trivial challenge of integrating pre-trained LLMs into dynamic data distributions, task structures, and user preferences. Pre-trained LLMs, when tailored for specific needs, often experience significant performance degradation in previous knowledge domains -- a phenomenon known as "catastrophic forgetting". While extensively studied in the continual learning (CL) community, it presents new manifestations in the realm of LLMs. In this survey, we provide a comprehensive overview of the current research progress on LLMs within the context of CL. This survey is structured into four main sections: we first describe an overview of continually learning LLMs, consisting of two directions of continuity: vertical continuity (or vertical continual learning), i.e., continual adaptation from general to specific capabilities, and horizontal continuity (or horizontal continual learning), i.e., continual adaptation across time and domains (Section 3). We then summarize three stages of learning LLMs in the context of modern CL: Continual Pre-Training (CPT), Domain-Adaptive Pre-training (DAP), and Continual Fine-Tuning (CFT) (Section 4). Then we provide an overview of evaluation protocols for continual learning with LLMs, along with the current available data sources (Section 5). Finally, we discuss intriguing questions pertaining to continual learning for LLMs (Section 6). The full list of papers examined in this survey is available at //github.com/Wang-ML-Lab/llm-continual-learning-survey.
The success of AI models relies on the availability of large, diverse, and high-quality datasets, which can be challenging to obtain due to data scarcity, privacy concerns, and high costs. Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns. This paper provides an overview of synthetic data research, discussing its applications, challenges, and future directions. We present empirical evidence from prior art to demonstrate its effectiveness and highlight the importance of ensuring its factuality, fidelity, and unbiasedness. We emphasize the need for responsible use of synthetic data to build more powerful, inclusive, and trustworthy language models.
Graph neural networks (GNNs) have been demonstrated to be a powerful algorithmic model in broad application fields for their effectiveness in learning over graphs. To scale GNN training up for large-scale and ever-growing graphs, the most promising solution is distributed training which distributes the workload of training across multiple computing nodes. However, the workflows, computational patterns, communication patterns, and optimization techniques of distributed GNN training remain preliminarily understood. In this paper, we provide a comprehensive survey of distributed GNN training by investigating various optimization techniques used in distributed GNN training. First, distributed GNN training is classified into several categories according to their workflows. In addition, their computational patterns and communication patterns, as well as the optimization techniques proposed by recent work are introduced. Second, the software frameworks and hardware platforms of distributed GNN training are also introduced for a deeper understanding. Third, distributed GNN training is compared with distributed training of deep neural networks, emphasizing the uniqueness of distributed GNN training. Finally, interesting issues and opportunities in this field are discussed.
Graphs are important data representations for describing objects and their relationships, which appear in a wide diversity of real-world scenarios. As one of a critical problem in this area, graph generation considers learning the distributions of given graphs and generating more novel graphs. Owing to their wide range of applications, generative models for graphs, which have a rich history, however, are traditionally hand-crafted and only capable of modeling a few statistical properties of graphs. Recent advances in deep generative models for graph generation is an important step towards improving the fidelity of generated graphs and paves the way for new kinds of applications. This article provides an extensive overview of the literature in the field of deep generative models for graph generation. Firstly, the formal definition of deep generative models for the graph generation and the preliminary knowledge are provided. Secondly, taxonomies of deep generative models for both unconditional and conditional graph generation are proposed respectively; the existing works of each are compared and analyzed. After that, an overview of the evaluation metrics in this specific domain is provided. Finally, the applications that deep graph generation enables are summarized and five promising future research directions are highlighted.
Object detection typically assumes that training and test data are drawn from an identical distribution, which, however, does not always hold in practice. Such a distribution mismatch will lead to a significant performance drop. In this work, we aim to improve the cross-domain robustness of object detection. We tackle the domain shift on two levels: 1) the image-level shift, such as image style, illumination, etc, and 2) the instance-level shift, such as object appearance, size, etc. We build our approach based on the recent state-of-the-art Faster R-CNN model, and design two domain adaptation components, on image level and instance level, to reduce the domain discrepancy. The two domain adaptation components are based on H-divergence theory, and are implemented by learning a domain classifier in adversarial training manner. The domain classifiers on different levels are further reinforced with a consistency regularization to learn a domain-invariant region proposal network (RPN) in the Faster R-CNN model. We evaluate our newly proposed approach using multiple datasets including Cityscapes, KITTI, SIM10K, etc. The results demonstrate the effectiveness of our proposed approach for robust object detection in various domain shift scenarios.