Robotics has dramatically increased our ability to gather data about our environments. This is an opportune time for the robotics and algorithms community to come together to contribute novel solutions to pressing environmental monitoring problems. In order to do so, it is useful to consider a taxonomy of problems and methods in this realm. We present the first comprehensive summary of decision theoretic approaches that are enabling efficient sampling of various kinds of environmental processes. Representations for different kinds of environments are explored, followed by a discussion of tasks of interest such as learning, localization, or monitoring. Finally, various algorithms to carry out these tasks are presented, along with a few illustrative prior results from the community.
Speaker diarization is well studied for constrained audios but little explored for challenging in-the-wild videos, which have more speakers, shorter utterances, and inconsistent on-screen speakers. We address this gap by proposing an audio-visual diarization model which combines audio-only and visual-centric sub-systems via late fusion. For audio, we show that an attractor-based end-to-end system (EEND-EDA) performs remarkably well when trained with our proposed recipe of a simulated proxy dataset, and propose an improved version, EEND-EDA++, that uses attention in decoding and a speaker recognition loss during training to better handle the larger number of speakers. The visual-centric sub-system leverages facial attributes and lip-audio synchrony for identity and speech activity estimation of on-screen speakers. Both sub-systems surpass the state of the art (SOTA) by a large margin, with the fused audio-visual system achieving a new SOTA on the AVA-AVD benchmark.
We consider the problem of distilling efficient network topologies for collective communications. We provide an algorithmic framework for constructing direct-connect topologies optimized for the latency vs. bandwidth trade-off associated with the workload. Our approach synthesizes many different topologies and schedules for a given cluster size and degree and then identifies the appropriate topology and schedule for a given workload. Our algorithms start from small, optimal base topologies and associated communication schedules and use a set of techniques that can be iteratively applied to derive much larger topologies and schedules. Additionally, we incorporate well-studied large-scale graph topologies into our algorithmic framework by producing efficient collective schedules for them using a novel polynomial-time algorithm. Our evaluation uses multiple testbeds and large-scale simulations to demonstrate significant performance benefits from our derived topologies and schedules.
Unsupervised anomaly localization, which plays a critical role in industrial manufacturing, is to identify anomalous regions that deviate from patterns established exclusively from nominal samples. Recent mainstream methods focus on approximating the target feature distribution by leveraging embeddings from ImageNet models. However, a common issue in many anomaly localization methods is the lack of adaptability of the feature approximations to specific targets. Consequently, their ability to effectively identify anomalous regions relies significantly on the data coverage provided by the finite resources in a memory bank. In this paper, we propose a novel subspace-aware feature reconstruction framework for anomaly localization. To achieve adaptive feature approximation, our proposed method involves the reconstruction of the feature representation through the self-expressive model designed to learn low-dimensional subspaces. Importantly, the sparsity of the subspace representation contributes to covering feature patterns from the same subspace with fewer resources, leading to a reduction in the memory bank. Extensive experiments across three industrial benchmark datasets demonstrate that our approach achieves competitive anomaly localization performance compared to state-of-the-art methods by adaptively reconstructing target features with a small number of samples.
The performance of Hamiltonian Monte Carlo crucially depends on its parameters, in particular the integration timestep and the number of integration steps. We present an adaptive general-purpose framework to automatically tune these parameters based on a loss function which promotes the fast exploration of phase-space. For this, we make use of a fully-differentiable set-up and use backpropagation for optimization. An attention-like loss is defined which allows for the gradient driven learning of the distribution of integration steps. We also highlight the importance of jittering for a smooth loss-surface. Our approach is demonstrated for the one-dimensional harmonic oscillator and alanine dipeptide, a small protein common as a test-case for simulation methods. We find a good correspondence between our loss and the autocorrelation times, resulting in well-tuned parameters for Hamiltonian Monte Carlo.
Rationalization is a self-explaining framework for NLP models. Conventional work typically uses the maximum mutual information (MMI) criterion to find the rationale that is most indicative of the target label. However, this criterion can be influenced by spurious features that correlate with the causal rationale or the target label. Instead of attempting to rectify the issues of the MMI criterion, we propose a novel criterion to uncover the causal rationale, termed the Minimum Conditional Dependence (MCD) criterion, which is grounded on our finding that the non-causal features and the target label are \emph{d-separated} by the causal rationale. By minimizing the dependence between the unselected parts of the input and the target label conditioned on the selected rationale candidate, all the causes of the label are compelled to be selected. In this study, we employ a simple and practical measure of dependence, specifically the KL-divergence, to validate our proposed MCD criterion. Empirically, we demonstrate that MCD improves the F1 score by up to $13.7\%$ compared to previous state-of-the-art MMI-based methods. Our code is available at: \url{//github.com/jugechengzi/Rationalization-MCD}.
Nonlinear model predictive control (NMPC) is typically restricted to short, finite horizons to limit the computational burden of online optimization. This makes a global planner necessary to avoid local minima when using NMPC for navigation in complex environments. For this reason, the performance of NMPC approaches are often limited by that of the global planner. While control policies trained with reinforcement learning (RL) can theoretically learn to avoid such local minima, they are usually unable to guarantee enforcement of general state constraints. In this paper, we augment a sampling-based stochastic NMPC (SNMPC) approach with an RL trained perception-informed value function. This allows the system to avoid observable local minima in the environment by reasoning about perception information beyond the finite planning horizon. By using Probably Approximately Correct NMPC (PAC-NMPC) as our base controller, we are also able to generate statistical guarantees of performance and safety. We demonstrate our approach in simulation and on hardware using a 1/10th scale rally car with lidar.
Graph-based collaborative filtering has emerged as a powerful paradigm for delivering personalized recommendations. Despite their demonstrated effectiveness, these methods often neglect the underlying intents of users, which constitute a pivotal facet of comprehensive user interests. Consequently, a series of approaches have arisen to tackle this limitation by introducing independent intent representations. However, these approaches fail to capture the intricate relationships between intents of different users and the compatibility between user intents and item properties. To remedy the above issues, we propose a novel method, named uniformly co-clustered intent modeling. Specifically, we devise a uniformly contrastive intent modeling module to bring together the embeddings of users with similar intents and items with similar properties. This module aims to model the nuanced relations between intents of different users and properties of different items, especially those unreachable to each other on the user-item graph. To model the compatibility between user intents and item properties, we design the user-item co-clustering module, maximizing the mutual information of co-clusters of users and items. This approach is substantiated through theoretical validation, establishing its efficacy in modeling compatibility to enhance the mutual information between user and item representations. Comprehensive experiments on various real-world datasets verify the effectiveness of the proposed framework.
To alleviate the expensive human labeling, semi-supervised semantic segmentation employs a few labeled images and an abundant of unlabeled images to predict the pixel-level label map with the same size. Previous methods often adopt co-training using two convolutional networks with the same architecture but different initialization, which fails to capture the sufficiently diverse features. This motivates us to use tri-training and develop the triple-view encoder to utilize the encoders with different architectures to derive diverse features, and exploit the knowledge distillation skill to learn the complementary semantics among these encoders. Moreover, existing methods simply concatenate the features from both encoder and decoder, resulting in redundant features that require large memory cost. This inspires us to devise a dual-frequency decoder that selects those important features by projecting the features from the spatial domain to the frequency domain, where the dual-frequency channel attention mechanism is introduced to model the feature importance. Therefore, we propose a Triple-view Knowledge Distillation framework, termed TriKD, for semi-supervised semantic segmentation, including the triple-view encoder and the dual-frequency decoder. Extensive experiments were conducted on two benchmarks, \ie, Pascal VOC 2012 and Cityscapes, whose results verify the superiority of the proposed method with a good tradeoff between precision and inference speed.
Multi-relation Question Answering is a challenging task, due to the requirement of elaborated analysis on questions and reasoning over multiple fact triples in knowledge base. In this paper, we present a novel model called Interpretable Reasoning Network that employs an interpretable, hop-by-hop reasoning process for question answering. The model dynamically decides which part of an input question should be analyzed at each hop; predicts a relation that corresponds to the current parsed results; utilizes the predicted relation to update the question representation and the state of the reasoning process; and then drives the next-hop reasoning. Experiments show that our model yields state-of-the-art results on two datasets. More interestingly, the model can offer traceable and observable intermediate predictions for reasoning analysis and failure diagnosis.
We consider the task of weakly supervised one-shot detection. In this task, we attempt to perform a detection task over a set of unseen classes, when training only using weak binary labels that indicate the existence of a class instance in a given example. The model is conditioned on a single exemplar of an unseen class and a target example that may or may not contain an instance of the same class as the exemplar. A similarity map is computed by using a Siamese neural network to map the exemplar and regions of the target example to a latent representation space and then computing cosine similarity scores between representations. An attention mechanism weights different regions in the target example, and enables learning of the one-shot detection task using the weaker labels alone. The model can be applied to detection tasks from different domains, including computer vision object detection. We evaluate our attention Siamese networks on a one-shot detection task from the audio domain, where it detects audio keywords in spoken utterances. Our model considerably outperforms a baseline approach and yields a 42.6% average precision for detection across 10 unseen classes. Moreover, architectural developments from computer vision object detection models such as a region proposal network can be incorporated into the model architecture, and results show that performance is expected to improve by doing so.