This paper proposes EyeNet, a novel semantic segmentation network for point clouds that addresses the critical yet often overlooked parameter of coverage area size. Inspired by human peripheral vision, EyeNet overcomes the limitations of conventional networks by introducing a simple but efficient multi-contour input and a parallel processing network with connection blocks between parallel streams. The proposed approach effectively addresses the challenges of dense point clouds, as demonstrated by our ablation studies and state-of-the-art performance on Large-Scale Outdoor datasets.
This paper focusses on the optimal implementation of a Mean Variance Estimation network (MVE network) (Nix and Weigend, 1994). This type of network is often used as a building block for uncertainty estimation methods in a regression setting, for instance Concrete dropout (Gal et al., 2017) and Deep Ensembles (Lakshminarayanan et al., 2017). Specifically, an MVE network assumes that the data is produced from a normal distribution with a mean function and variance function. The MVE network outputs a mean and variance estimate and optimizes the network parameters by minimizing the negative loglikelihood. In our paper, we present two significant insights. Firstly, the convergence difficulties reported in recent work can be relatively easily prevented by following the simple yet often overlooked recommendation from the original authors that a warm-up period should be used. During this period, only the mean is optimized with a fixed variance. We demonstrate the effectiveness of this step through experimentation, highlighting that it should be standard practice. As a sidenote, we examine whether, after the warm-up, it is beneficial to fix the mean while optimizing the variance or to optimize both simultaneously. Here, we do not observe a substantial difference. Secondly, we introduce a novel improvement of the MVE network: separate regularization of the mean and the variance estimate. We demonstrate, both on toy examples and on a number of benchmark UCI regression data sets, that following the original recommendations and the novel separate regularization can lead to significant improvements.
This paper introduces an approach that combines the language reasoning capabilities of large language models (LLMs) with the benefits of local training to tackle complex, domain-specific tasks. Specifically, the authors demonstrate their approach by extracting structured condition codes from pathology reports. The proposed approach utilizes local LLMs, which can be fine-tuned to respond to specific generative instructions and provide structured outputs. The authors collected a dataset of over 150k uncurated surgical pathology reports, containing gross descriptions, final diagnoses, and condition codes. They trained different model architectures, including LLaMA, BERT and LongFormer and evaluated their performance. The results show that the LLaMA-based models significantly outperform BERT-style models across all evaluated metrics, even with extremely reduced precision. The LLaMA models performed especially well with large datasets, demonstrating their ability to handle complex, multi-label tasks. Overall, this work presents an effective approach for utilizing LLMs to perform domain-specific tasks using accessible hardware, with potential applications in the medical domain, where complex data extraction and classification are required.
With the advent of the big data era, the data quality problem is becoming more critical. Among many factors, data with missing values is one primary issue, and thus developing effective imputation models is a key topic in the research community. Recently, a major research direction is to employ neural network models such as self-organizing mappings or automatic encoders for filling missing values. However, these classical methods can hardly discover interrelated features and common features simultaneously among data attributes. Especially, it is a very typical problem for classical autoencoders that they often learn invalid constant mappings, which dramatically hurts the filling performance. To solve the above-mentioned problems, we propose a missing-value-filling model based on a feature-fusion-enhanced autoencoder. We first incorporate into an autoencoder a hidden layer that consists of de-tracking neurons and radial basis function neurons, which can enhance the ability of learning interrelated features and common features. Besides, we develop a missing value filling strategy based on dynamic clustering that is incorporated into an iterative optimization process. This design can enhance the multi-dimensional feature fusion ability and thus improves the dynamic collaborative missing-value-filling performance. The effectiveness of the proposed model is validated by extensive experiments compared to a variety of baseline methods on thirteen data sets.
Long-Term tracking is a hot topic in Computer Vision. In this context, competitive models are presented every year, showing a constant growth rate in performances, mainly measured in standardized protocols as Visual Object Tracking (VOT) and Object Tracking Benchmark (OTB). Fusion-trackers strategy has been applied over last few years for overcoming the known re-detection problem, turning out to be an important breakthrough. Following this approach, this work aims to generalize the fusion concept to an arbitrary number of trackers used as baseline trackers in the pipeline, leveraging a learning phase to better understand how outcomes correlate with each other, even when no target is present. A model and data independence conjecture will be evidenced in the manuscript, yielding a recall of 0.738 on LTB-50 dataset when learning from VOT-LT2022, and 0.619 by reversing the two datasets. In both cases, results are strongly competitive with state-of-the-art and recall turns out to be the first on the podium.
Driving scene understanding is to obtain comprehensive scene information through the sensor data and provide a basis for downstream tasks, which is indispensable for the safety of self-driving vehicles. Specific perception tasks, such as object detection and scene graph generation, are commonly used. However, the results of these tasks are only equivalent to the characterization of sampling from high-dimensional scene features, which are not sufficient to represent the scenario. In addition, the goal of perception tasks is inconsistent with human driving that just focuses on what may affect the ego-trajectory. Therefore, we propose an end-to-end Interpretable Implicit Driving Scene Understanding (II-DSU) model to extract implicit high-dimensional scene features as scene understanding results guided by a planning module and to validate the plausibility of scene understanding using auxiliary perception tasks for visualization. Experimental results on CARLA benchmarks show that our approach achieves the new state-of-the-art and is able to obtain scene features that embody richer scene information relevant to driving, enabling superior performance of the downstream planning.
Robot teleoperation gains great success in various situations, including chemical pollution rescue, disaster relief, and long-distance manipulation. In this article, we propose a virtual reality (VR) based robot teleoperation system to achieve more efficient and natural interaction with humans in different scenes. A user-friendly VR interface is designed to help users interact with a desktop scene using their hands efficiently and intuitively. To improve user experience and reduce workload, we simulate the process in the physics engine to help build a preview of the scene after manipulation in the virtual scene before execution. We conduct experiments with different users and compare our system with a direct control method across several teleoperation tasks. The user study demonstrates that the proposed system enables users to perform operations more instinctively with a lighter mental workload. Users can perform pick-and-place and object-stacking tasks in a considerably short time, even for beginners. Our code is available at //github.com/lingxiaomeng/VR_Teleoperation_Gen3.
Complex scenario of ultrasound image, in which adjacent tissues (i.e., background) share similar intensity with and even contain richer texture patterns than lesion region (i.e., foreground), brings a unique challenge for accurate lesion segmentation. This work presents a decomposition-coupling network, called DC-Net, to deal with this challenge in a (foreground-background) saliency map disentanglement-fusion manner. The DC-Net consists of decomposition and coupling subnets, and the former preliminarily disentangles original image into foreground and background saliency maps, followed by the latter for accurate segmentation under the assistance of saliency prior fusion. The coupling subnet involves three aspects of fusion strategies, including: 1) regional feature aggregation (via differentiable context pooling operator in the encoder) to adaptively preserve local contextual details with the larger receptive field during dimension reduction; 2) relation-aware representation fusion (via cross-correlation fusion module in the decoder) to efficiently fuse low-level visual characteristics and high-level semantic features during resolution restoration; 3) dependency-aware prior incorporation (via coupler) to reinforce foreground-salient representation with the complementary information derived from background representation. Furthermore, a harmonic loss function is introduced to encourage the network to focus more attention on low-confidence and hard samples. The proposed method is evaluated on two ultrasound lesion segmentation tasks, which demonstrates the remarkable performance improvement over existing state-of-the-art methods.
We consider the problem of explaining the predictions of graph neural networks (GNNs), which otherwise are considered as black boxes. Existing methods invariably focus on explaining the importance of graph nodes or edges but ignore the substructures of graphs, which are more intuitive and human-intelligible. In this work, we propose a novel method, known as SubgraphX, to explain GNNs by identifying important subgraphs. Given a trained GNN model and an input graph, our SubgraphX explains its predictions by efficiently exploring different subgraphs with Monte Carlo tree search. To make the tree search more effective, we propose to use Shapley values as a measure of subgraph importance, which can also capture the interactions among different subgraphs. To expedite computations, we propose efficient approximation schemes to compute Shapley values for graph data. Our work represents the first attempt to explain GNNs via identifying subgraphs explicitly and directly. Experimental results show that our SubgraphX achieves significantly improved explanations, while keeping computations at a reasonable level.
In this paper, we propose a novel Feature Decomposition and Reconstruction Learning (FDRL) method for effective facial expression recognition. We view the expression information as the combination of the shared information (expression similarities) across different expressions and the unique information (expression-specific variations) for each expression. More specifically, FDRL mainly consists of two crucial networks: a Feature Decomposition Network (FDN) and a Feature Reconstruction Network (FRN). In particular, FDN first decomposes the basic features extracted from a backbone network into a set of facial action-aware latent features to model expression similarities. Then, FRN captures the intra-feature and inter-feature relationships for latent features to characterize expression-specific variations, and reconstructs the expression feature. To this end, two modules including an intra-feature relation modeling module and an inter-feature relation modeling module are developed in FRN. Experimental results on both the in-the-lab databases (including CK+, MMI, and Oulu-CASIA) and the in-the-wild databases (including RAF-DB and SFEW) show that the proposed FDRL method consistently achieves higher recognition accuracy than several state-of-the-art methods. This clearly highlights the benefit of feature decomposition and reconstruction for classifying expressions.
Multi-relation Question Answering is a challenging task, due to the requirement of elaborated analysis on questions and reasoning over multiple fact triples in knowledge base. In this paper, we present a novel model called Interpretable Reasoning Network that employs an interpretable, hop-by-hop reasoning process for question answering. The model dynamically decides which part of an input question should be analyzed at each hop; predicts a relation that corresponds to the current parsed results; utilizes the predicted relation to update the question representation and the state of the reasoning process; and then drives the next-hop reasoning. Experiments show that our model yields state-of-the-art results on two datasets. More interestingly, the model can offer traceable and observable intermediate predictions for reasoning analysis and failure diagnosis.