Counterfactual explanations constitute among the most popular methods for analyzing black-box systems since they can recommend cost-efficient and actionable changes to the input of a system to obtain the desired system output. While most of the existing counterfactual methods explain a single instance, several real-world problems, such as customer satisfaction, require the identification of a single counterfactual that can satisfy multiple instances (e.g. customers) simultaneously. To address this limitation, in this work, we propose a flexible two-stage algorithm for finding groups of instances and computing cost-efficient multi-instance counterfactual explanations. The paper presents the algorithm and its performance against popular alternatives through a comparative evaluation.
We focus on the problem of how we can enable a robot to collaborate seamlessly with a human partner, specifically in scenarios where preexisting data is sparse. Much prior work in human-robot collaboration uses observational models of humans (i.e. models that treat the robot purely as an observer) to choose the robot's behavior, but such models do not account for the influence the robot has on the human's actions, which may lead to inefficient interactions. We instead formulate the problem of optimally choosing a collaborative robot's behavior based on a conditional model of the human that depends on the robot's future behavior. First, we propose a novel model-based formulation of conditional behavior prediction that allows the robot to infer the human's intentions based on its future plan in data-sparse environments. We then show how to utilize a conditional model for proactive goal selection and safe trajectory generation around human collaborators. Finally, we use our proposed proactive controller in a collaborative task with real users to show that it can improve users' interactions with a robot collaborator quantitatively and qualitatively.
Leveraging complementary relationships across modalities has recently drawn a lot of attention in multimodal emotion recognition. Most of the existing approaches explored cross-attention to capture the complementary relationships across the modalities. However, the modalities may also exhibit weak complementary relationships, which may deteriorate the cross-attended features, resulting in poor multimodal feature representations. To address this problem, we propose Inconsistency-Aware Cross-Attention (IACA), which can adaptively select the most relevant features on-the-fly based on the strong or weak complementary relationships across audio and visual modalities. Specifically, we design a two-stage gating mechanism that can adaptively select the appropriate relevant features to deal with weak complementary relationships. Extensive experiments are conducted on the challenging Aff-Wild2 dataset to show the robustness of the proposed model.
How to improve the ability of scene representation is a key issue in vision-oriented decision-making applications, and current approaches usually learn task-relevant state representations within visual reinforcement learning to address this problem. While prior work typically introduces one-step behavioral similarity metrics with elements (e.g., rewards and actions) to extract task-relevant state information from observations, they often ignore the inherent dynamics relationships among the elements that are essential for learning accurate representations, which further impedes the discrimination of short-term similar task/behavior information in long-term dynamics transitions. To alleviate this problem, we propose an intrinsic dynamics-driven representation learning method with sequence models in visual reinforcement learning, namely DSR. Concretely, DSR optimizes the parameterized encoder by the state-transition dynamics of the underlying system, which prompts the latent encoding information to satisfy the state-transition process and then the state space and the noise space can be distinguished. In the implementation and to further improve the representation ability of DSR on encoding similar tasks, sequential elements' frequency domain and multi-step prediction are adopted for sequentially modeling the inherent dynamics. Finally, experimental results show that DSR has achieved significant performance improvements in the visual Distracting DMControl control tasks, especially with an average of 78.9\% over the backbone baseline. Further results indicate that it also achieves the best performances in real-world autonomous driving applications on the CARLA simulator. Moreover, qualitative analysis results validate that our method possesses the superior ability to learn generalizable scene representations on visual tasks. The source code is available at //github.com/DMU-XMU/DSR.
This paper tackles the problem of passive gaze estimation using both event and frame data. Considering the inherently different physiological structures, it is intractable to accurately estimate gaze purely based on a given state. Thus, we reformulate gaze estimation as the quantification of the state shifting from the current state to several prior registered anchor states. Specifically, we propose a two-stage learning-based gaze estimation framework that divides the whole gaze estimation process into a coarse-to-fine approach involving anchor state selection and final gaze location. Moreover, to improve the generalization ability, instead of learning a large gaze estimation network directly, we align a group of local experts with a student network, where a novel denoising distillation algorithm is introduced to utilize denoising diffusion techniques to iteratively remove inherent noise in event data. Extensive experiments demonstrate the effectiveness of the proposed method, which surpasses state-of-the-art methods by a large margin of 15$\%$. The code will be publicly available at //github.com/jdjdli/Denoise_distill_EF_gazetracker.
Quadruped robots are showing impressive abilities to navigate the real world. If they are to become more integrated into society, social trust in interactions with humans will become increasingly important. Additionally, robots will need to be adaptable to different humans based on individual preferences. In this work, we study the social interaction task of learning optimal handshakes for quadruped robots based on user preferences. While maintaining balance on three legs, we parameterize handshakes with a Central Pattern Generator consisting of an amplitude, frequency, stiffness, and duration. Through 10 binary choices between handshakes, we learn a belief model to fit individual preferences for 25 different subjects. Our results show that this is an effective strategy, with 76% of users feeling happy with their identified optimal handshake parameters, and 20% feeling neutral. Moreover, compared with random and test handshakes, the optimized handshakes have significantly decreased errors in amplitude and frequency, lower Dynamic Time Warping scores, and improved energy efficiency, all of which indicate robot synchronization to the user's preferences. Video results can be found at //youtu.be/elvPv8mq1KM .
In recent years, extensive research has been conducted on the vulnerability of ASR systems, revealing that black-box adversarial example attacks pose significant threats to real-world ASR systems. However, most existing black-box attacks rely on queries to the target ASRs, which is impractical when queries are not permitted. In this paper, we propose ZQ-Attack, a transfer-based adversarial attack on ASR systems in the zero-query black-box setting. Through a comprehensive review and categorization of modern ASR technologies, we first meticulously select surrogate ASRs of diverse types to generate adversarial examples. Following this, ZQ-Attack initializes the adversarial perturbation with a scaled target command audio, rendering it relatively imperceptible while maintaining effectiveness. Subsequently, to achieve high transferability of adversarial perturbations, we propose a sequential ensemble optimization algorithm, which iteratively optimizes the adversarial perturbation on each surrogate model, leveraging collaborative information from other models. We conduct extensive experiments to evaluate ZQ-Attack. In the over-the-line setting, ZQ-Attack achieves a 100% success rate of attack (SRoA) with an average signal-to-noise ratio (SNR) of 21.91dB on 4 online speech recognition services, and attains an average SRoA of 100% and SNR of 19.67dB on 16 open-source ASRs. For commercial intelligent voice control devices, ZQ-Attack also achieves a 100% SRoA with an average SNR of 15.77dB in the over-the-air setting.
We describe a class of tasks called decision-oriented dialogues, in which AI assistants must collaborate with one or more humans via natural language to help them make complex decisions. We formalize three domains in which users face everyday decisions: (1) choosing an assignment of reviewers to conference papers, (2) planning a multi-step itinerary in a city, and (3) negotiating travel plans for a group of friends. In each of these settings, AI assistants and users have disparate abilities that they must combine to arrive at the best decision: assistants can access and process large amounts of information, while users have preferences and constraints external to the system. For each task, we build a dialogue environment where agents receive a reward based on the quality of the final decision they reach. Using these environments, we collect human-human dialogues with humans playing the role of assistant. To compare how current AI assistants communicate in these settings, we present baselines using large language models in self-play. Finally, we highlight a number of challenges models face in decision-oriented dialogues, ranging from efficient communication to reasoning and optimization, and release our environments as a testbed for future modeling work.
With the rise of powerful pre-trained vision-language models like CLIP, it becomes essential to investigate ways to adapt these models to downstream datasets. A recently proposed method named Context Optimization (CoOp) introduces the concept of prompt learning -- a recent trend in NLP -- to the vision domain for adapting pre-trained vision-language models. Specifically, CoOp turns context words in a prompt into a set of learnable vectors and, with only a few labeled images for learning, can achieve huge improvements over intensively-tuned manual prompts. In our study we identify a critical problem of CoOp: the learned context is not generalizable to wider unseen classes within the same dataset, suggesting that CoOp overfits base classes observed during training. To address the problem, we propose Conditional Context Optimization (CoCoOp), which extends CoOp by further learning a lightweight neural network to generate for each image an input-conditional token (vector). Compared to CoOp's static prompts, our dynamic prompts adapt to each instance and are thus less sensitive to class shift. Extensive experiments show that CoCoOp generalizes much better than CoOp to unseen classes, even showing promising transferability beyond a single dataset; and yields stronger domain generalization performance as well. Code is available at //github.com/KaiyangZhou/CoOp.
Many recent state-of-the-art recommender systems such as D-ATT, TransNet and DeepCoNN exploit reviews for representation learning. This paper proposes a new neural architecture for recommendation with reviews. Our model operates on a multi-hierarchical paradigm and is based on the intuition that not all reviews are created equal, i.e., only a select few are important. The importance, however, should be dynamically inferred depending on the current target. To this end, we propose a review-by-review pointer-based learning scheme that extracts important reviews, subsequently matching them in a word-by-word fashion. This enables not only the most informative reviews to be utilized for prediction but also a deeper word-level interaction. Our pointer-based method operates with a novel gumbel-softmax based pointer mechanism that enables the incorporation of discrete vectors within differentiable neural architectures. Our pointer mechanism is co-attentive in nature, learning pointers which are co-dependent on user-item relationships. Finally, we propose a multi-pointer learning scheme that learns to combine multiple views of interactions between user and item. Overall, we demonstrate the effectiveness of our proposed model via extensive experiments on \textbf{24} benchmark datasets from Amazon and Yelp. Empirical results show that our approach significantly outperforms existing state-of-the-art, with up to 19% and 71% relative improvement when compared to TransNet and DeepCoNN respectively. We study the behavior of our multi-pointer learning mechanism, shedding light on evidence aggregation patterns in review-based recommender systems.
Recommender systems play a crucial role in mitigating the problem of information overload by suggesting users' personalized items or services. The vast majority of traditional recommender systems consider the recommendation procedure as a static process and make recommendations following a fixed strategy. In this paper, we propose a novel recommender system with the capability of continuously improving its strategies during the interactions with users. We model the sequential interactions between users and a recommender system as a Markov Decision Process (MDP) and leverage Reinforcement Learning (RL) to automatically learn the optimal strategies via recommending trial-and-error items and receiving reinforcements of these items from users' feedbacks. In particular, we introduce an online user-agent interacting environment simulator, which can pre-train and evaluate model parameters offline before applying the model online. Moreover, we validate the importance of list-wise recommendations during the interactions between users and agent, and develop a novel approach to incorporate them into the proposed framework LIRD for list-wide recommendations. The experimental results based on a real-world e-commerce dataset demonstrate the effectiveness of the proposed framework.