In real-world applications, users often require both translations and transcriptions of speech to enhance their comprehension, particularly in streaming scenarios where incremental generation is necessary. This paper introduces a streaming Transformer-Transducer that jointly generates automatic speech recognition (ASR) and speech translation (ST) outputs using a single decoder. To produce ASR and ST content effectively with minimal latency, we propose a joint token-level serialized output training method that interleaves source and target words by leveraging an off-the-shelf textual aligner. Experiments in monolingual (it-en) and multilingual (\{de,es,it\}-en) settings demonstrate that our approach achieves the best quality-latency balance. With an average ASR latency of 1s and ST latency of 1.3s, our model shows no degradation or even improves output quality compared to separate ASR and ST models, yielding an average improvement of 1.1 WER and 0.4 BLEU in the multilingual case.
Data produced by on-chip sensors in modern SoCs contains a large amount of information such as occurring faults, aging status, accumulated radiation dose, performance characteristics, environmental and other operational parameters. Such information provides insight into the overall health of a system's hardware as well as the operability of individual modules. This gives a chance to mitigate faults and avoid using faulty units, thus enabling hardware health management. Raw data from embedded sensors cannot be immediately used to perform health management tasks. In most cases, the information about occurred faults needs to be analyzed taking into account the history of the previously reported fault events and other collected statistics. For this purpose, we propose a special structure called Health Map (HM) that holds the information about functional resources, occurring faults and maps relationships between these. In addition, we propose algorithms for aggregation and classification of data received from on-chip sensors. The proposed Health Map contains detailed information on a particular system level (e.g., module, SoC, board) that can be compiled into a summary of hardware health status that in its turn enables distributed hierarchical health management by using this information at a higher level of system hierarchy, thus increasing the system's availability and effective lifetime.
As the rapidly evolving field of machine learning continues to produce incredibly useful tools and models, the potential for quantum computing to provide speed up for machine learning algorithms is becoming increasingly desirable. In particular, quantum circuits in place of classical convolutional filters for image detection-based tasks are being investigated for the ability to exploit quantum advantage. However, these attempts, referred to as quantum convolutional neural networks (QCNNs), lack the ability to efficiently process data with multiple channels and therefore are limited to relatively simple inputs. In this work, we present a variety of hardware-adaptable quantum circuit ansatzes for use as convolutional kernels, and demonstrate that the quantum neural networks we report outperform existing QCNNs on classification tasks involving multi-channel data. We envision that the ability of these implementations to effectively learn inter-channel information will allow quantum machine learning methods to operate with more complex data. This work is available as open source at //github.com/anthonysmaldone/QCNN-Multi-Channel-Supervised-Learning.
In the contemporary landscape of social media, an alarming number of users express negative emotions, some of which manifest as strong suicidal intentions. This situation underscores a profound need for trained psychological counselors who can enact effective mental interventions. However, the development of these professionals is often an imperative but time-consuming task. Consequently, the mobilization of non-professionals or volunteers in this capacity emerges as a pressing concern. Leveraging the capabilities of artificial intelligence, and in particular, the recent advances in large language models, offers a viable solution to this challenge. This paper introduces a novel model constructed on the foundation of large language models to fully assist non-professionals in providing psychological interventions on online user discourses. This framework makes it plausible to harness the power of non-professional counselors in a meaningful way. A comprehensive study was conducted involving ten professional psychological counselors of varying expertise, evaluating the system across five critical dimensions. The findings affirm that our system is capable of analyzing patients' issues with relative accuracy and proffering professional-level strategies recommendations, thereby enhancing support for non-professionals. This research serves as a compelling validation of the application of large language models in the field of psychology and lays the groundwork for a new paradigm of community-based mental health support.
As deep learning predictive models become an integral part of a large spectrum of precision agricultural systems, a barrier to the adoption of such automated solutions is the lack of user trust in these highly complex, opaque and uncertain models. Indeed, deep neural networks are not equipped with any explicit guarantees that can be used to certify the system's performance, especially in highly varying uncontrolled environments such as the ones typically faced in computer vision for agriculture.Fortunately, certain methods developed in other communities can prove to be important for agricultural applications. This article presents the conformal prediction framework that provides valid statistical guarantees on the predictive performance of any black box prediction machine, with almost no assumptions, applied to the problem of deep visual classification of weeds and crops in real-world conditions. The framework is exposed with a focus on its practical aspects and special attention accorded to the Adaptive Prediction Sets (APS) approach that delivers marginal guarantees on the model's coverage. Marginal results are then shown to be insufficient to guarantee performance on all groups of individuals in the population as characterized by their environmental and pedo-climatic auxiliary data gathered during image acquisition.To tackle this shortcoming, group-conditional conformal approaches are presented: the ''classical'' method that consists of iteratively applying the APS procedure on all groups, and a proposed elegant reformulation and implementation of the procedure using quantile regression on group membership indicators. Empirical results showing the validity of the proposed approach are presented and compared to the marginal APS then discussed.
Research software plays a crucial role in advancing scientific knowledge, but ensuring its sustainability, maintainability, and long-term viability is an ongoing challenge. The Sustainable Research Software Institute (SRSI) Model has been designed to address the concerns, and presents a comprehensive framework designed to promote sustainable practices in the research software community. However the SRSI Model does not address the transitional requirements for the Exascale Computing Project (ECP) Software Technology (ECP-ST) focus area specifically. This white paper provides an overview and detailed description of how ECP-ST will transition into the SRSI in a compressed time frame that a) meets the needs of the ECP end-of-technical-activities deadline; and b) ensures the continuity of the sustainability efforts that are already underway.
In the era where personal devices and applications are pervasive, individuals are continuously generating and interacting with a vast amount of data. Despite this, access to and control over such data remains challenging due to its scattering across various app providers and formats. This paper presents Human-Scale Computing, a vision and an approach where every individual has straightforward, unified access to their data across all devices, apps, and services. Key to this solution is the Human Scale Portal, a progressively designed intermediary that integrates different applications and service providers. This design adopts a transitional development and deployment strategy, involving an initial bootstrapping phase to engage application providers, an acceleration phase to enhance the convenience of access, and an eventual solution. We believe that this progressive "narrow waist" design can bridge the gap between the current state of data access and our envisioned future of human-scale access.
Motion planning and control are crucial components of robotics applications like automated driving. Here, spatio-temporal hard constraints like system dynamics and safety boundaries (e.g., obstacles) restrict the robot's motions. Direct methods from optimal control solve a constrained optimization problem. However, in many applications finding a proper cost function is inherently difficult because of the weighting of partially conflicting objectives. On the other hand, Imitation Learning (IL) methods such as Behavior Cloning (BC) provide an intuitive framework for learning decision-making from offline demonstrations and constitute a promising avenue for planning and control in complex robot applications. Prior work primarily relied on soft constraint approaches, which use additional auxiliary loss terms describing the constraints. However, catastrophic safety-critical failures might occur in out-of-distribution (OOD) scenarios. This work integrates the flexibility of IL with hard constraint handling in optimal control. Our approach constitutes a general framework for constraint robotic motion planning and control, as well as traffic agent simulation, whereas we focus on mobile robot and automated driving applications. Hard constraints are integrated into the learning problem in a differentiable manner, via explicit completion and gradient-based correction. Simulated experiments of mobile robot navigation and automated driving provide evidence for the performance of the proposed method.
This paper presents a novel visual-language model called DFER-CLIP, which is based on the CLIP model and designed for in-the-wild Dynamic Facial Expression Recognition (DFER). Specifically, the proposed DFER-CLIP consists of a visual part and a textual part. For the visual part, based on the CLIP image encoder, a temporal model consisting of several Transformer encoders is introduced for extracting temporal facial expression features, and the final feature embedding is obtained as a learnable "class" token. For the textual part, we use as inputs textual descriptions of the facial behaviour that is related to the classes (facial expressions) that we are interested in recognising -- those descriptions are generated using large language models, like ChatGPT. This, in contrast to works that use only the class names and more accurately captures the relationship between them. Alongside the textual description, we introduce a learnable token which helps the model learn relevant context information for each expression during training. Extensive experiments demonstrate the effectiveness of the proposed method and show that our DFER-CLIP also achieves state-of-the-art results compared with the current supervised DFER methods on the DFEW, FERV39k, and MAFW benchmarks. Code is publicly available at //github.com/zengqunzhao/DFER-CLIP.
Anomaly detection (AD) tasks have been solved using machine learning algorithms in various domains and applications. The great majority of these algorithms use normal data to train a residual-based model, and assign anomaly scores to unseen samples based on their dissimilarity with the learned normal regime. The underlying assumption of these approaches is that anomaly-free data is available for training. This is, however, often not the case in real-world operational settings, where the training data may be contaminated with a certain fraction of abnormal samples. Training with contaminated data, in turn, inevitably leads to a deteriorated AD performance of the residual-based algorithms. In this paper we introduce a framework for a fully unsupervised refinement of contaminated training data for AD tasks. The framework is generic and can be applied to any residual-based machine learning model. We demonstrate the application of the framework to two public datasets of multivariate time series machine data from different application fields. We show its clear superiority over the naive approach of training with contaminated data without refinement. Moreover, we compare it to the ideal, unrealistic reference in which anomaly-free data would be available for training. Since the approach exploits information from the anomalies, and not only from the normal regime, it is comparable and often outperforms the ideal baseline as well.
To solve the information explosion problem and enhance user experience in various online applications, recommender systems have been developed to model users preferences. Although numerous efforts have been made toward more personalized recommendations, recommender systems still suffer from several challenges, such as data sparsity and cold start. In recent years, generating recommendations with the knowledge graph as side information has attracted considerable interest. Such an approach can not only alleviate the abovementioned issues for a more accurate recommendation, but also provide explanations for recommended items. In this paper, we conduct a systematical survey of knowledge graph-based recommender systems. We collect recently published papers in this field and summarize them from two perspectives. On the one hand, we investigate the proposed algorithms by focusing on how the papers utilize the knowledge graph for accurate and explainable recommendation. On the other hand, we introduce datasets used in these works. Finally, we propose several potential research directions in this field.