Contrastive explanations, where one decision is explained in contrast to another, are supposed to be closer to how humans explain a decision than non-contrastive explanations, where the decision is not necessarily referenced to an alternative. This claim has never been empirically validated. We analyze four English text-classification datasets (SST2, DynaSent, BIOS and DBpedia-Animals). We fine-tune and extract explanations from three different models (RoBERTa, GTP-2, and T5), each in three different sizes and apply three post-hoc explainability methods (LRP, GradientxInput, GradNorm). We furthermore collect and release human rationale annotations for a subset of 100 samples from the BIOS dataset for contrastive and non-contrastive settings. A cross-comparison between model-based rationales and human annotations, both in contrastive and non-contrastive settings, yields a high agreement between the two settings for models as well as for humans. Moreover, model-based explanations computed in both settings align equally well with human rationales. Thus, we empirically find that humans do not necessarily explain in a contrastive manner.9 pages, long paper at ACL 2022 proceedings.
Dynamic Bayesian Networks (DBNs), renowned for their interpretability, have become increasingly vital in representing complex stochastic processes in various domains such as gene expression analysis, healthcare, and traffic prediction. Structure learning of DBNs from data is challenging, particularly for datasets with thousands of variables. Most current algorithms for DBN structure learning are adaptations from those used in static Bayesian Networks (BNs), and are typically focused on small-scale problems. In order to solve large-scale problems while taking full advantage of existing algorithms, this paper introduces a novel divide-and-conquer strategy, originally developed for static BNs, and adapts it for large-scale DBN structure learning. In this work, we specifically concentrate on 2 Time-sliced Bayesian Networks (2-TBNs), a special class of DBNs. Furthermore, we leverage the prior knowledge of 2-TBNs to enhance the performance of the strategy we introduce. Our approach significantly improves the scalability and accuracy of 2-TBN structure learning. Experimental results demonstrate the effectiveness of our method, showing substantial improvements over existing algorithms in both computational efficiency and structure learning accuracy. On problem instances with more than 1,000 variables, our approach improves two accuracy metrics by 74.45% and 110.94% on average , respectively, while reducing runtime by 93.65% on average.
Generalizable manipulation of articulated objects remains a challenging problem in many real-world scenarios, given the diverse object structures, functionalities, and goals. In these tasks, both semantic interpretations and physical plausibilities are crucial for a policy to succeed. To address this problem, we propose SAGE, a novel framework that bridges the understanding of semantic and actionable parts of articulated objects to achieve generalizable manipulation under language instructions. Given a manipulation goal specified by natural language, an instruction interpreter with Large Language Models (LLMs) first translates them into programmatic actions on the object's semantic parts. This process also involves a scene context parser for understanding the visual inputs, which is designed to generate scene descriptions with both rich information and accurate interaction-related facts by joining the forces of generalist Visual-Language Models (VLMs) and domain-specialist part perception models. To further convert the action programs into executable policies, a part grounding module then maps the object semantic parts suggested by the instruction interpreter into so-called Generalizable Actionable Parts (GAParts). Finally, an interactive feedback module is incorporated to respond to failures, which greatly increases the robustness of the overall framework. Experiments both in simulation environments and on real robots show that our framework can handle a large variety of articulated objects with diverse language-instructed goals. We also provide a new benchmark for language-guided articulated-object manipulation in realistic scenarios.
As discussions around 6G begin, it is important to carefully quantify the spectral efficiency gains actually realized by deployed 5G networks as compared to 4G through various enhancements such as higher modulation, beamforming, and MIMO. This will inform the design of future cellular systems, especially in the mid-bands, which provide a good balance between bandwidth and propagation. Similar to 4G, 5G also utilizes low-band (<1 GHz) and mid-band spectrum (1 to 6 GHz), and hence comparing the performance of 4G and 5G in these bands will provide insights into how further improvements can be attained. In this work, we address a crucial question: is the performance boost in 5G compared to 4G primarily a result of increased bandwidth, or do the other enhancements play significant roles, and if so, under what circumstances? Hence, we conduct city-wide measurements of 4G and 5G cellular networks deployed in low- and mid-bands in Chicago and Minneapolis, and carefully quantify the contributions of different aspects of 5G advancements to its improved throughput performance. Our analyses show that (i) compared to 4G, the throughput improvement in 5G today is mainly influenced by the wider channel bandwidth, both from single channels and channel aggregation, (ii) in addition to wider channels, improved 5G throughput requires better signal conditions, which can be delivered by denser deployment and/or use of beamforming in mid-bands, (iii) the channel rank in real-world environments rarely supports the full 4 layers of 4x4 MIMO and (iv) advanced features such as MU-MIMO and higher order modulation such as 1024-QAM have yet to be widely deployed. These observations and conclusions lead one to consider designing the next generation of cellular systems to have wider channels, perhaps with improved channel aggregation, dense deployment with more beams.
We prove that the well-known (strong) fully-concurrent bisimilarity and the novel i-causal-net bisimilarity, which is a sligtlhy coarser variant of causal-net bisimilarity, are decidable for finite bounded Petri nets. The proofs are based on a generalization of the ordered marking proof technique that Vogler used to demonstrate that (strong) fully-concurrent bisimilarity (or, equivalently, history-preserving bisimilarity) is decidable on finite safe nets.
Magnetic resonance imaging (MRI) is commonly used for brain tumor segmentation, which is critical for patient evaluation and treatment planning. To reduce the labor and expertise required for labeling, weakly-supervised semantic segmentation (WSSS) methods with class activation mapping (CAM) have been proposed. However, existing CAM methods suffer from low resolution due to strided convolution and pooling layers, resulting in inaccurate predictions. In this study, we propose a novel CAM method, Attentive Multiple-Exit CAM (AME-CAM), that extracts activation maps from multiple resolutions to hierarchically aggregate and improve prediction accuracy. We evaluate our method on the BraTS 2021 dataset and show that it outperforms state-of-the-art methods.
We examine the linear regression problem in a challenging high-dimensional setting with correlated predictors to explain and predict relevant quantities, with explicitly allowing the regression coefficient to vary from sparse to dense. Most classical high-dimensional regression estimators require some degree of sparsity. We discuss the more recent concepts of variable screening and random projection as computationally fast dimension reduction tools, and propose a new random projection matrix tailored to the linear regression problem with a theoretical bound on the gain in expected prediction error over conventional random projections. Around this new random projection, we built the Sparse Projected Averaged Regression (SPAR) method combining probabilistic variable screening steps with the random projection steps to obtain an ensemble of small linear models. In difference to existing methods, we introduce a thresholding parameter to obtain some degree of sparsity. In extensive simulations and two real data applications we guide through the elements of this method and compare prediction and variable selection performance to various competitors. For prediction, our method performs at least as good as the best competitors in most settings with a high number of truly active variables, while variable selection remains a hard task for all methods in high dimensions.
Graph Neural Networks (GNNs) have recently become increasingly popular due to their ability to learn complex systems of relations or interactions arising in a broad spectrum of problems ranging from biology and particle physics to social networks and recommendation systems. Despite the plethora of different models for deep learning on graphs, few approaches have been proposed thus far for dealing with graphs that present some sort of dynamic nature (e.g. evolving features or connectivity over time). In this paper, we present Temporal Graph Networks (TGNs), a generic, efficient framework for deep learning on dynamic graphs represented as sequences of timed events. Thanks to a novel combination of memory modules and graph-based operators, TGNs are able to significantly outperform previous approaches being at the same time more computationally efficient. We furthermore show that several previous models for learning on dynamic graphs can be cast as specific instances of our framework. We perform a detailed ablation study of different components of our framework and devise the best configuration that achieves state-of-the-art performance on several transductive and inductive prediction tasks for dynamic graphs.
Object detection typically assumes that training and test data are drawn from an identical distribution, which, however, does not always hold in practice. Such a distribution mismatch will lead to a significant performance drop. In this work, we aim to improve the cross-domain robustness of object detection. We tackle the domain shift on two levels: 1) the image-level shift, such as image style, illumination, etc, and 2) the instance-level shift, such as object appearance, size, etc. We build our approach based on the recent state-of-the-art Faster R-CNN model, and design two domain adaptation components, on image level and instance level, to reduce the domain discrepancy. The two domain adaptation components are based on H-divergence theory, and are implemented by learning a domain classifier in adversarial training manner. The domain classifiers on different levels are further reinforced with a consistency regularization to learn a domain-invariant region proposal network (RPN) in the Faster R-CNN model. We evaluate our newly proposed approach using multiple datasets including Cityscapes, KITTI, SIM10K, etc. The results demonstrate the effectiveness of our proposed approach for robust object detection in various domain shift scenarios.
While existing machine learning models have achieved great success for sentiment classification, they typically do not explicitly capture sentiment-oriented word interaction, which can lead to poor results for fine-grained analysis at the snippet level (a phrase or sentence). Factorization Machine provides a possible approach to learning element-wise interaction for recommender systems, but they are not directly applicable to our task due to the inability to model contexts and word sequences. In this work, we develop two Position-aware Factorization Machines which consider word interaction, context and position information. Such information is jointly encoded in a set of sentiment-oriented word interaction vectors. Compared to traditional word embeddings, SWI vectors explicitly capture sentiment-oriented word interaction and simplify the parameter learning. Experimental results show that while they have comparable performance with state-of-the-art methods for document-level classification, they benefit the snippet/sentence-level sentiment analysis.
While it is nearly effortless for humans to quickly assess the perceptual similarity between two images, the underlying processes are thought to be quite complex. Despite this, the most widely used perceptual metrics today, such as PSNR and SSIM, are simple, shallow functions, and fail to account for many nuances of human perception. Recently, the deep learning community has found that features of the VGG network trained on the ImageNet classification task has been remarkably useful as a training loss for image synthesis. But how perceptual are these so-called "perceptual losses"? What elements are critical for their success? To answer these questions, we introduce a new Full Reference Image Quality Assessment (FR-IQA) dataset of perceptual human judgments, orders of magnitude larger than previous datasets. We systematically evaluate deep features across different architectures and tasks and compare them with classic metrics. We find that deep features outperform all previous metrics by huge margins. More surprisingly, this result is not restricted to ImageNet-trained VGG features, but holds across different deep architectures and levels of supervision (supervised, self-supervised, or even unsupervised). Our results suggest that perceptual similarity is an emergent property shared across deep visual representations.