This work proposes an interactive art installation "Mood spRing" designed to reflect the mood of the environment through interpretation of language and tone. Mood spRing consists of an AI program that controls an immersive 3D animation of the seasons. If the AI program perceives the language and tone of the users as pleasant, the animation progresses through idealized renditions of seasons. Otherwise, it slips into unpleasant weather and natural disasters of the season. To interpret the language and tone of the user interaction, hybrid state-of-the-art emotion detection methods are applied to the user audio and text inputs. The emotional states detected separately from tone and language are fused by a novel approach that aims at minimizing the possible model disparity across diverse demographic groups.
The recent success of large language models (LLMs) has shown great potential to develop more powerful conversational recommender systems (CRSs), which rely on natural language conversations to satisfy user needs. In this paper, we embark on an investigation into the utilization of ChatGPT for conversational recommendation, revealing the inadequacy of the existing evaluation protocol. It might over-emphasize the matching with the ground-truth items or utterances generated by human annotators, while neglecting the interactive nature of being a capable CRS. To overcome the limitation, we further propose an interactive Evaluation approach based on LLMs named iEvaLM that harnesses LLM-based user simulators. Our evaluation approach can simulate various interaction scenarios between users and systems. Through the experiments on two publicly available CRS datasets, we demonstrate notable improvements compared to the prevailing evaluation protocol. Furthermore, we emphasize the evaluation of explainability, and ChatGPT showcases persuasive explanation generation for its recommendations. Our study contributes to a deeper comprehension of the untapped potential of LLMs for CRSs and provides a more flexible and easy-to-use evaluation framework for future research endeavors. The codes and data are publicly available at //github.com/RUCAIBox/iEvaLM-CRS.
With the prosperity of the video surveillance, multiple visual sensors have been applied for an accurate localization of pedestrians in a specific area, which facilitate various applications like intelligent safety or new retailing. However, previous methods rely on the supervision from the human annotated pedestrian positions in every video frame and camera view, which is a heavy burden in addition to the necessary camera calibration and synchronization. Therefore, we propose in this paper an Unsupervised Multi-view Pedestrian Detection approach (UMPD) to eliminate the need of annotations to learn a multi-view pedestrian detector. 1) Firstly, Semantic-aware Iterative Segmentation (SIS) is proposed to extract discriminative visual representations of the input images from different camera views via an unsupervised pretrained model, then convert them into 2D segments of pedestrians, based on our proposed iterative Principal Component Analysis and the zero-shot semantic classes from the vision-language pretrained models. 2) Secondly, we propose Vertical-aware Differential Rendering (VDR) to not only learn the densities and colors of 3D voxels by the masks of SIS, images and camera poses, but also constraint the voxels to be vertical towards the ground plane, following the physical characteristics of pedestrians. 3) Thirdly, the densities of 3D voxels learned by VDR are projected onto Bird-Eyes-View as the final detection results. Extensive experiments on popular multi-view pedestrian detection benchmarks, i.e., Wildtrack and MultiviewX, show that our proposed UMPD approach, as the first unsupervised method to our best knowledge, performs competitively with the previous state-of-the-art supervised techniques. Code will be available.
We introduce ViDaS, a two-stream, fully convolutional Video, Depth-Aware Saliency network to address the problem of attention modeling ``in-the-wild", via saliency prediction in videos. Contrary to existing visual saliency approaches using only RGB frames as input, our network employs also depth as an additional modality. The network consists of two visual streams, one for the RGB frames, and one for the depth frames. Both streams follow an encoder-decoder approach and are fused to obtain a final saliency map. The network is trained end-to-end and is evaluated in a variety of different databases with eye-tracking data, containing a wide range of video content. Although the publicly available datasets do not contain depth, we estimate it using three different state-of-the-art methods, to enable comparisons and a deeper insight. Our method outperforms in most cases state-of-the-art models and our RGB-only variant, which indicates that depth can be beneficial to accurately estimating saliency in videos displayed on a 2D screen. Depth has been widely used to assist salient object detection problems, where it has been proven to be very beneficial. Our problem though differs significantly from salient object detection, since it is not restricted to specific salient objects, but predicts human attention in a more general aspect. These two problems not only have different objectives, but also different ground truth data and evaluation metrics. To our best knowledge, this is the first competitive deep learning video saliency estimation approach that combines both RGB and Depth features to address the general problem of saliency estimation ``in-the-wild". The code will be publicly released.
The adoption of modern technologies for use in healthcare has become an inevitable change. The emergence of artificial intelligence drives this digital disruption. Artificial intelligence has augmented machine capabilities to act like and interact with human beings. As the healthcare industry adopts technology in most areas, an area in healthcare that is touched by this change is clinical practice. New technologies are being designed to improve healthcare services. One aspect of these technologies is voice user interfaces. This paper reviews applications of voice user interfaces in clinical settings. Several information sources were consulted, and based on eligibility criteria, a search was conducted, and ten papers selected. This study presents findings from the last ten years (2009-2019). The results are categorized based on findings, also they contribute to the discussion and the research gaps identified for future study as regards context-aware voice user interfaces and the appearance of conversational agents from a given set of options.
Different variants of a Forensic Automatic Speaker Recognition (FASR) system based on Emphasized Channel Attention, Propagation and Aggregation in Time Delay Neural Network (ECAPA-TDNN) are tested under conditions reflecting those of a real forensic voice comparison case, according to the forensic_eval_01 evaluation campaign settings. Using this recent neural model as an embedding extraction block, various normalization strategies at the level of embeddings and scores allow us to observe the variations in system performance, in terms of discriminating power, accuracy and precision metrics. From the achieved results it is possible to state that ECAPA-TDNN can be very successfully used as a base component of a FASR system, managing to surpass the previous state of the art, at least in the context of the considered operating conditions.
Human intelligence's adaptability is remarkable, allowing us to adjust to new tasks and multi-modal environments swiftly. This skill is evident from a young age as we acquire new abilities and solve problems by imitating others or following natural language instructions. The research community is actively pursuing the development of interactive "embodied agents" that can engage in natural conversations with humans and assist them with real-world tasks. These agents must possess the ability to promptly request feedback in case communication breaks down or instructions are unclear. Additionally, they must demonstrate proficiency in learning new vocabulary specific to a given domain. In this paper, we made the following contributions: (1) a crowd-sourcing tool for collecting grounded language instructions; (2) the largest dataset of grounded language instructions; and (3) several state-of-the-art baselines. These contributions are suitable as a foundation for further research.
Experiments on online marketplaces and social networks suffer from interference, where the outcome of a unit is impacted by the treatment status of other units. We propose a framework for modeling interference using a ubiquitous deployment mechanism for experiments, staggered roll-out designs, which slowly increase the fraction of units exposed to the treatment to mitigate any unanticipated adverse side effects. Our main idea is to leverage the temporal variations in treatment assignments introduced by roll-outs to model the interference structure. We first present a set of model identification conditions under which the estimation of common estimands is possible and show how these conditions are aided by roll-out designs. Since there are often multiple competing models of interference in practice, we then develop a model selection method that evaluates models based on their ability to explain outcome variation observed along the roll-out. Through simulations, we show that our heuristic model selection method, Leave-One-Period-Out, outperforms other baselines. We conclude with a set of considerations, robustness checks, and potential limitations for practitioners wishing to use our framework.
The problem of Multiple Object Tracking (MOT) consists in following the trajectory of different objects in a sequence, usually a video. In recent years, with the rise of Deep Learning, the algorithms that provide a solution to this problem have benefited from the representational power of deep models. This paper provides a comprehensive survey on works that employ Deep Learning models to solve the task of MOT on single-camera videos. Four main steps in MOT algorithms are identified, and an in-depth review of how Deep Learning was employed in each one of these stages is presented. A complete experimental comparison of the presented works on the three MOTChallenge datasets is also provided, identifying a number of similarities among the top-performing methods and presenting some possible future research directions.
To address the sparsity and cold start problem of collaborative filtering, researchers usually make use of side information, such as social networks or item attributes, to improve recommendation performance. This paper considers the knowledge graph as the source of side information. To address the limitations of existing embedding-based and path-based methods for knowledge-graph-aware recommendation, we propose Ripple Network, an end-to-end framework that naturally incorporates the knowledge graph into recommender systems. Similar to actual ripples propagating on the surface of water, Ripple Network stimulates the propagation of user preferences over the set of knowledge entities by automatically and iteratively extending a user's potential interests along links in the knowledge graph. The multiple "ripples" activated by a user's historically clicked items are thus superposed to form the preference distribution of the user with respect to a candidate item, which could be used for predicting the final clicking probability. Through extensive experiments on real-world datasets, we demonstrate that Ripple Network achieves substantial gains in a variety of scenarios, including movie, book and news recommendation, over several state-of-the-art baselines.
Most previous event extraction studies have relied heavily on features derived from annotated event mentions, thus cannot be applied to new event types without annotation effort. In this work, we take a fresh look at event extraction and model it as a grounding problem. We design a transferable neural architecture, mapping event mentions and types jointly into a shared semantic space using structural and compositional neural networks, where the type of each event mention can be determined by the closest of all candidate types . By leveraging (1)~available manual annotations for a small set of existing event types and (2)~existing event ontologies, our framework applies to new event types without requiring additional annotation. Experiments on both existing event types (e.g., ACE, ERE) and new event types (e.g., FrameNet) demonstrate the effectiveness of our approach. \textit{Without any manual annotations} for 23 new event types, our zero-shot framework achieved performance comparable to a state-of-the-art supervised model which is trained from the annotations of 500 event mentions.