The devices designed for the Internet-of-Things encompass a large variety of distinct processor architectures, forming a highly heterogeneous zoo. In order to tackle this, we employ a simulator to estimate the performance of the matrix-matrix multiplication (GEMM) kernel on processors designed to operate at the edge. Our simulator adheres to the modern implementations of GEMM, advocated by GotoBLAS2, BLIS, OpenBLAS, etc., to carefully account for the amount of data transfers across the memory hierarchy of different algorithmic variants of the kernel. %Armed with this tool, A small collection of experiments provide the necessary data to calibrate the simulator and deliver highly accurate estimations of the execution time for a given processor architecture.
In recent years, the accumulation of data across various institutions has garnered attention for the technology of confidential data analysis, which improves analytical accuracy by sharing data between multiple institutions while protecting sensitive information. Among these methods, Data Collaboration Analysis (DCA) is noted for its efficiency in terms of computational cost and communication load, facilitating data sharing and analysis across different institutions while safeguarding confidential information. However, existing optimization problems for determining the necessary collaborative functions have faced challenges, such as the optimal solution for the collaborative representation often being a zero matrix and the difficulty in understanding the process of deriving solutions. This research addresses these issues by formulating the optimization problem through the segmentation of matrices into column vectors and proposing a solution method based on the generalized eigenvalue problem. Additionally, we demonstrate methods for constructing collaborative functions more effectively through weighting and the selection of efficient algorithms suited to specific situations. Experiments using real-world datasets have shown that our proposed formulation and solution for the collaborative function optimization problem achieve superior predictive accuracy compared to existing methods.
Diffusion models have achieved state-of-the-art synthesis quality on both visual and audio tasks, and recent works further adapt them to textual data by diffusing on the embedding space. In this paper, we conduct systematic studies of the optimization challenges encountered with both the embedding space and the denoising model, which have not been carefully explored. Firstly, the data distribution is learnable for embeddings, which may lead to the collapse of the embedding space and unstable training. To alleviate this problem, we propose a new objective called the anchor loss which is more efficient than previous methods. Secondly, we find the noise levels of conventional schedules are insufficient for training a desirable denoising model while introducing varying degrees of degeneration in consequence. To address this challenge, we propose a novel framework called noise rescaling. Based on the above analysis, we propose Difformer, an embedding diffusion model based on Transformer. Experiments on varieties of seminal text generation tasks show the effectiveness of the proposed methods and the superiority of Difformer over previous state-of-the-art embedding diffusion baselines.
Multimodal emotion recognition (MMER) systems typically outperform unimodal systems by leveraging the inter- and intra-modal relationships between, e.g., visual, textual, physiological, and auditory modalities. This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention. This framework can exploit the complementary nature of diverse modalities to improve predictive accuracy. Separate backbones capture intra-modal spatiotemporal dependencies within each modality over video sequences. Subsequently, our JMT fusion architecture integrates the individual modality embeddings, allowing the model to effectively capture inter- and intra-modal relationships. Extensive experiments on two challenging expression recognition tasks -- (1) dimensional emotion recognition on the Affwild2 dataset (with face and voice) and (2) pain estimation on the Biovid dataset (with face and biosensors) -- indicate that our JMT fusion can provide a cost-effective solution for MMER. Empirical results show that MMER systems with our proposed fusion allow us to outperform relevant baseline and state-of-the-art methods.
With the robust uptick in the applications of Bayesian external data borrowing, eliciting a prior distribution with the proper amount of information becomes increasingly critical. The prior effective sample size (ESS) is an intuitive and efficient measure for this purpose. The majority of ESS definitions have been proposed in the context of borrowing control information. While many Bayesian models can be naturally extended to leveraging external information on the treatment effect scale, very little attention has been directed to computing the prior ESS in this setting. In this research, we bridge this methodological gap by extending the popular ELIR ESS definition. We lay out the general framework, and derive the prior ESS for various types of endpoints and treatment effect measures. The posterior distribution and the predictive consistency property of ESS are also examined. The methods are implemented in R programs available on GitHub: //github.com/squallteo/TrtEffESS.
We investigate the computational power of particle methods, a well-established class of algorithms with applications in scientific computing and computer simulation. The computational power of a compute model determines the class of problems it can solve. Automata theory allows describing the computational power of abstract machines (automata) and the problems they can solve. At the top of the Chomsky hierarchy of formal languages and grammars are Turing machines, which resemble the concept on which most modern computers are built. Although particle methods can be interpreted as automata based on their formal definition, their computational power has so far not been studied. We address this by analyzing Turing completeness of particle methods. In particular, we prove two sets of restrictions under which a particle method is still Turing powerful, and we show when it loses Turing powerfulness. This contributes to understanding the theoretical foundations of particle methods and provides insight into the powerfulness of computer simulations.
Observable operator models (OOMs) offer a powerful framework for modelling stochastic processes, surpassing the traditional hidden Markov models (HMMs) in generality and efficiency. However, using OOMs to model infinite-dimensional processes poses significant theoretical challenges. This article explores a rigorous approach to developing an approximation theory for OOMs of infinite-dimensional processes. Building upon foundational work outlined in an unpublished tutorial [Jae98], an inner product structure on the space of future distributions is rigorously established and the continuity of observable operators with respect to the associated 2-norm is proven. The original theorem proven in this thesis describes a fundamental obstacle in making an infinite-dimensional space of future distributions into a Hilbert space. The presented findings lay the groundwork for future research in approximating observable operators of infinite-dimensional processes, while a remedy to the encountered obstacle is suggested.
The success of AI models relies on the availability of large, diverse, and high-quality datasets, which can be challenging to obtain due to data scarcity, privacy concerns, and high costs. Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns. This paper provides an overview of synthetic data research, discussing its applications, challenges, and future directions. We present empirical evidence from prior art to demonstrate its effectiveness and highlight the importance of ensuring its factuality, fidelity, and unbiasedness. We emphasize the need for responsible use of synthetic data to build more powerful, inclusive, and trustworthy language models.
In pace with developments in the research field of artificial intelligence, knowledge graphs (KGs) have attracted a surge of interest from both academia and industry. As a representation of semantic relations between entities, KGs have proven to be particularly relevant for natural language processing (NLP), experiencing a rapid spread and wide adoption within recent years. Given the increasing amount of research work in this area, several KG-related approaches have been surveyed in the NLP research community. However, a comprehensive study that categorizes established topics and reviews the maturity of individual research streams remains absent to this day. Contributing to closing this gap, we systematically analyzed 507 papers from the literature on KGs in NLP. Our survey encompasses a multifaceted review of tasks, research types, and contributions. As a result, we present a structured overview of the research landscape, provide a taxonomy of tasks, summarize our findings, and highlight directions for future work.
Sampling methods (e.g., node-wise, layer-wise, or subgraph) has become an indispensable strategy to speed up training large-scale Graph Neural Networks (GNNs). However, existing sampling methods are mostly based on the graph structural information and ignore the dynamicity of optimization, which leads to high variance in estimating the stochastic gradients. The high variance issue can be very pronounced in extremely large graphs, where it results in slow convergence and poor generalization. In this paper, we theoretically analyze the variance of sampling methods and show that, due to the composite structure of empirical risk, the variance of any sampling method can be decomposed into \textit{embedding approximation variance} in the forward stage and \textit{stochastic gradient variance} in the backward stage that necessities mitigating both types of variance to obtain faster convergence rate. We propose a decoupled variance reduction strategy that employs (approximate) gradient information to adaptively sample nodes with minimal variance, and explicitly reduces the variance introduced by embedding approximation. We show theoretically and empirically that the proposed method, even with smaller mini-batch sizes, enjoys a faster convergence rate and entails a better generalization compared to the existing methods.
Object detection typically assumes that training and test data are drawn from an identical distribution, which, however, does not always hold in practice. Such a distribution mismatch will lead to a significant performance drop. In this work, we aim to improve the cross-domain robustness of object detection. We tackle the domain shift on two levels: 1) the image-level shift, such as image style, illumination, etc, and 2) the instance-level shift, such as object appearance, size, etc. We build our approach based on the recent state-of-the-art Faster R-CNN model, and design two domain adaptation components, on image level and instance level, to reduce the domain discrepancy. The two domain adaptation components are based on H-divergence theory, and are implemented by learning a domain classifier in adversarial training manner. The domain classifiers on different levels are further reinforced with a consistency regularization to learn a domain-invariant region proposal network (RPN) in the Faster R-CNN model. We evaluate our newly proposed approach using multiple datasets including Cityscapes, KITTI, SIM10K, etc. The results demonstrate the effectiveness of our proposed approach for robust object detection in various domain shift scenarios.