Cyber-Physical Systems (CPSs), e.g., elevator systems and autonomous driving systems, are progressively permeating our everyday lives. To ensure their safety, various analyses need to be conducted, such as anomaly detection and time-to-event analysis (the focus of this paper). Recently, it has been widely accepted that digital Twins (DTs) can serve as an efficient method to aid in the development, maintenance, and safe and secure operation of CPSs. However, CPSs frequently evolve, e.g., with new or updated functionalities, which demand their corresponding DTs be co-evolved, i.e., in synchronization with the CPSs. To that end, we propose a novel method, named PPT, utilizing an uncertainty-aware transfer learning for DT evolution. Specifically, we first pretrain PPT with a pretraining dataset to acquire generic knowledge about the CPSs, followed by adapting it to a specific CPS with the help of prompt tuning. Results highlight that PPT is effective in time-to-event analysis in both elevator and ADSs case studies, on average, outperforming a baseline method by 7.31 and 12.58 in terms of Huber loss, respectively. The experiment results also affirm the effectiveness of transfer learning, prompt tuning and uncertainty quantification in terms of reducing Huber loss by at least 21.32, 3.14 and 4.08, respectively, in both case studies.
This paper introduces Robust Spin (R-Spin), a data-efficient self-supervised fine-tuning framework for speaker and noise-invariant speech representations by learning discrete acoustic units with speaker-invariant clustering (Spin). R-Spin resolves Spin's issues and enhances content representations by learning to predict acoustic pieces. R-Spin offers a 12X reduction in computational resources compared to previous state-of-the-art methods while outperforming them in severely distorted speech scenarios. This paper provides detailed analyses to show how discrete units contribute to speech encoder training and improving robustness in diverse acoustic environments.
Designing a safe, trusted, and ethical AI may be practically impossible; however, designing AI with safe, trusted, and ethical use in mind is possible and necessary in safety and mission-critical domains like aerospace. Safe, trusted, and ethical use of AI are often used interchangeably; however, a system can be safely used but not trusted or ethical, have a trusted use that is not safe or ethical, and have an ethical use that is not safe or trusted. This manuscript serves as a primer to illuminate the nuanced differences between these concepts, with a specific focus on applications of Human-AI teaming in aerospace system control, where humans may be in, on, or out-of-the-loop of decision-making.
Large Language Models (LLMs) have demonstrated remarkable performance on coding related tasks, particularly on assisting humans in programming and facilitating programming automation. However, existing benchmarks for evaluating the code understanding and generation capacities of LLMs suffer from severe limitations. First, most benchmarks are deficient as they focus on a narrow range of popular programming languages and specific tasks, whereas the real-world software development scenarios show dire need to implement systems with multilingual programming environments to satisfy diverse requirements. Practical programming practices also strongly expect multi-task settings for testing coding capabilities of LLMs comprehensively and robustly. Second, most benchmarks also fail to consider the actual executability and the consistency of execution results of the generated code. To bridge these gaps between existing benchmarks and expectations from practical applications, we introduce CodeScope, an execution-based, multilingual, multi-task, multi-dimensional evaluation benchmark for comprehensively gauging LLM capabilities on coding tasks. CodeScope covers 43 programming languages and 8 coding tasks. It evaluates the coding performance of LLMs from three dimensions (perspectives): difficulty, efficiency, and length. To facilitate execution-based evaluations of code generation, we develop MultiCodeEngine, an automated code execution engine that supports 14 programming languages. Finally, we systematically evaluate and analyze 8 mainstream LLMs on CodeScope tasks and demonstrate the superior breadth and challenges of CodeScope for evaluating LLMs on code understanding and generation tasks compared to other benchmarks. The CodeScope benchmark and datasets are publicly available at //github.com/WeixiangYAN/CodeScope.
Randomized experiments are a powerful methodology for data-driven evaluation of decisions or interventions. Yet, their validity may be undermined by network interference. This occurs when the treatment of one unit impacts not only its outcome but also that of connected units, biasing traditional treatment effect estimations. Our study introduces a new framework to accommodate complex and unknown network interference, moving beyond specialized models in the existing literature. Our framework, which we term causal message-passing, is grounded in a high-dimensional approximate message passing methodology and is specifically tailored to experimental design settings with prevalent network interference. Utilizing causal message-passing, we present a practical algorithm for estimating the total treatment effect and demonstrate its efficacy in four numerical scenarios, each with its unique interference structure.
Despite Multi-modal Large Language Models (MM-LLMs) have made exciting strides recently, they are still struggling to efficiently model the interactions among multi-modal inputs and the generation in non-textual modalities. In this work, we propose TEAL (Tokenize and Embed ALl)}, an approach to treat the input from any modality as a token sequence and learn a joint embedding space for all modalities. Specifically, for the input from any modality, TEAL first discretizes it into a token sequence with the off-the-shelf tokenizer and embeds the token sequence into a joint embedding space with a learnable embedding matrix. MM-LLMs just need to predict the multi-modal tokens autoregressively as the textual LLMs do. Finally, the corresponding de-tokenizer is applied to generate the output in each modality based on the predicted token sequence. With the joint embedding space, TEAL enables the frozen LLMs to perform both understanding and generation tasks involving non-textual modalities, such as image and audio. Thus, the textual LLM can just work as an interface and maintain its high performance in textual understanding and generation. Experiments show that TEAL achieves substantial improvements in multi-modal understanding, and implements a simple scheme for multi-modal generations.
SoCs are now designed with their own AI accelerator segment to accommodate the ever-increasing demand of Deep Learning (DL) applications. With powerful MAC engines for matrix multiplications, these accelerators show high computing performance. However, because of limited memory resources (i.e., bandwidth and capacity), they fail to achieve optimum system performance during large batch training and inference. In this work, we propose a memory system with high on-chip capacity and bandwidth to shift the gear of AI accelerators from memory-bound to achieving system-level peak performance. We develop the memory system with DTCO-enabled customized SOT-MRAM as large on-chip memory through STCO and detailed characterization of the DL workloads. %We evaluate our workload-aware memory system on the CV and NLP benchmarks and observe significant PPA improvement compared to an SRAM-based in both inference and training modes. Our workload-aware memory system achieves 8X energy and 9X latency improvement on Computer Vision (CV) benchmarks in training and 8X energy and 4.5X latency improvement on Natural Language Processing (NLP) benchmarks in training while consuming only around 50% of SRAM area at iso-capacity.
With the ever-increasing popularity of pretrained Video-Language Models (VidLMs), there is a pressing need to develop robust evaluation methodologies that delve deeper into their visio-linguistic capabilities. To address this challenge, we present ViLMA (Video Language Model Assessment), a task-agnostic benchmark that places the assessment of fine-grained capabilities of these models on a firm footing. Task-based evaluations, while valuable, fail to capture the complexities and specific temporal aspects of moving images that VidLMs need to process. Through carefully curated counterfactuals, ViLMA offers a controlled evaluation suite that sheds light on the true potential of these models, as well as their performance gaps compared to human-level understanding. ViLMA also includes proficiency tests, which assess basic capabilities deemed essential to solving the main counterfactual tests. We show that current VidLMs' grounding abilities are no better than those of vision-language models which use static images. This is especially striking once the performance on proficiency tests is factored in. Our benchmark serves as a catalyst for future research on VidLMs, helping to highlight areas that still need to be explored.
We propose VQ-NeRF, a two-branch neural network model that incorporates Vector Quantization (VQ) to decompose and edit reflectance fields in 3D scenes. Conventional neural reflectance fields use only continuous representations to model 3D scenes, despite the fact that objects are typically composed of discrete materials in reality. This lack of discretization can result in noisy material decomposition and complicated material editing. To address these limitations, our model consists of a continuous branch and a discrete branch. The continuous branch follows the conventional pipeline to predict decomposed materials, while the discrete branch uses the VQ mechanism to quantize continuous materials into individual ones. By discretizing the materials, our model can reduce noise in the decomposition process and generate a segmentation map of discrete materials. Specific materials can be easily selected for further editing by clicking on the corresponding area of the segmentation outcomes. Additionally, we propose a dropout-based VQ codeword ranking strategy to predict the number of materials in a scene, which reduces redundancy in the material segmentation process. To improve usability, we also develop an interactive interface to further assist material editing. We evaluate our model on both computer-generated and real-world scenes, demonstrating its superior performance. To the best of our knowledge, our model is the first to enable discrete material editing in 3D scenes.
The Internet of Things (IoT), the network integrating billions of smart physical devices embedded with sensors, software, and communication technologies for the purpose of connecting and exchanging data with other devices and systems, is a critical and rapidly expanding component of our modern world. The IoT ecosystem provides a rich source of real-world modalities such as motion, thermal, geolocation, imaging, depth, sensors, video, and audio for prediction tasks involving the pose, gaze, activities, and gestures of humans as well as the touch, contact, pose, 3D of physical objects. Machine learning presents a rich opportunity to automatically process IoT data at scale, enabling efficient inference for impact in understanding human wellbeing, controlling physical devices, and interconnecting smart cities. To develop machine learning technologies for IoT, this paper proposes MultiIoT, the most expansive IoT benchmark to date, encompassing over 1.15 million samples from 12 modalities and 8 tasks. MultiIoT introduces unique challenges involving (1) learning from many sensory modalities, (2) fine-grained interactions across long temporal ranges, and (3) extreme heterogeneity due to unique structure and noise topologies in real-world sensors. We also release a set of strong modeling baselines, spanning modality and task-specific methods to multisensory and multitask models to encourage future research in multisensory representation learning for IoT.
Most existing works in visual question answering (VQA) are dedicated to improving the accuracy of predicted answers, while disregarding the explanations. We argue that the explanation for an answer is of the same or even more importance compared with the answer itself, since it makes the question and answering process more understandable and traceable. To this end, we propose a new task of VQA-E (VQA with Explanation), where the computational models are required to generate an explanation with the predicted answer. We first construct a new dataset, and then frame the VQA-E problem in a multi-task learning architecture. Our VQA-E dataset is automatically derived from the VQA v2 dataset by intelligently exploiting the available captions. We have conducted a user study to validate the quality of explanations synthesized by our method. We quantitatively show that the additional supervision from explanations can not only produce insightful textual sentences to justify the answers, but also improve the performance of answer prediction. Our model outperforms the state-of-the-art methods by a clear margin on the VQA v2 dataset.