Despite the remarkable success of convolutional neural networks in various computer vision tasks, recognizing indoor scenes still presents a significant challenge due to their complex composition. Consequently, effectively leveraging semantic information in the scene has been a key issue in advancing indoor scene recognition. Unfortunately, the accuracy of semantic segmentation has limited the effectiveness of existing approaches for leveraging semantic information. As a result, many of these approaches remain at the stage of auxiliary labeling or co-occurrence statistics, with few exploring the contextual relationships between the semantic elements directly within the scene. In this paper, we propose the Semantic Region Relationship Model (SRRM), which starts directly from the semantic information inside the scene. Specifically, SRRM adopts an adaptive and efficient approach to mitigate the negative impact of semantic ambiguity and then models the semantic region relationship to perform scene recognition. Additionally, to more comprehensively exploit the information contained in the scene, we combine the proposed SRRM with the PlacesCNN module to create the Combined Semantic Region Relation Model (CSRRM), and propose a novel information combining approach to effectively explore the complementary contents between them. CSRRM significantly outperforms the SOTA methods on the MIT Indoor 67, reduced Places365 dataset, and SUN RGB-D without retraining. The code is available at: //github.com/ChuanxinSong/SRRM
Neural scene reconstruction methods have achieved impressive performance in reconstructing complex geometry and low-textured regions in large scenes. However, these methods heavily rely on 3D supervised information which is costly and time-consuming to obtain in the real world. In this paper, we propose a novel neural reconstruction method that reconstructs scenes without 3D supervision. We perform differentiable volume rendering for scene reconstruction by using accessible 2D images as supervision. We impose geometry to improve the reconstruction quality of complex geometry regions in the scenes, and impose plane constraints to improve the reconstruction quality of low-textured regions in the scenes. Specifically, we introduce a signed distance function (SDF) field, a color field, and a probability field to represent the scene, and optimize the fields under the differentiable ray marching to reconstruct the scene. Besides, we impose geometric constraints that project 3D points on the surface to similar-looking regions with similar features in different views. We also impose plane constraints to make large planes keep parallel or vertical to the wall or floor. These two constraints help to reconstruct accurate and smooth geometry structures of the scene. Without 3D supervision information, our method achieves competitive reconstruction compared with some existing methods that use 3D information as supervision on the ScanNet dataset.
Video-based human pose estimation (VHPE) is a vital yet challenging task. While deep learning methods have made significant progress for the VHPE, most approaches to this task implicitly model the long-range interaction between joints by enlarging the receptive field of the convolution. Unlike prior methods, we design a lightweight and plug-and-play joint relation extractor (JRE) to model the associative relationship between joints explicitly and automatically. The JRE takes the pseudo heatmaps of joints as input and calculates the similarity between pseudo heatmaps. In this way, the JRE flexibly learns the relationship between any two joints, allowing it to learn the rich spatial configuration of human poses. Moreover, the JRE can infer invisible joints according to the relationship between joints, which is beneficial for the model to locate occluded joints. Then, combined with temporal semantic continuity modeling, we propose a Relation-based Pose Semantics Transfer Network (RPSTN) for video-based human pose estimation. Specifically, to capture the temporal dynamics of poses, the pose semantic information of the current frame is transferred to the next with a joint relation guided pose semantics propagator (JRPSP). The proposed model can transfer the pose semantic features from the non-occluded frame to the occluded frame, making our method robust to the occlusion. Furthermore, the proposed JRE module is also suitable for image-based human pose estimation. The proposed RPSTN achieves state-of-the-art results on the video-based Penn Action dataset, Sub-JHMDB dataset, and PoseTrack2018 dataset. Moreover, the proposed JRE improves the performance of backbones on the image-based COCO2017 dataset. Code is available at //github.com/YHDang/pose-estimation.
Dynamics models learned from visual observations have shown to be effective in various robotic manipulation tasks. One of the key questions for learning such dynamics models is what scene representation to use. Prior works typically assume representation at a fixed dimension or resolution, which may be inefficient for simple tasks and ineffective for more complicated tasks. In this work, we investigate how to learn dynamic and adaptive representations at different levels of abstraction to achieve the optimal trade-off between efficiency and effectiveness. Specifically, we construct dynamic-resolution particle representations of the environment and learn a unified dynamics model using graph neural networks (GNNs) that allows continuous selection of the abstraction level. During test time, the agent can adaptively determine the optimal resolution at each model-predictive control (MPC) step. We evaluate our method in object pile manipulation, a task we commonly encounter in cooking, agriculture, manufacturing, and pharmaceutical applications. Through comprehensive evaluations both in the simulation and the real world, we show that our method achieves significantly better performance than state-of-the-art fixed-resolution baselines at the gathering, sorting, and redistribution of granular object piles made with various instances like coffee beans, almonds, corn, etc.
The task of dynamic scene graph generation (SGG) from videos is complicated and challenging due to the inherent dynamics of a scene, temporal fluctuation of model predictions, and the long-tailed distribution of the visual relationships in addition to the already existing challenges in image-based SGG. Existing methods for dynamic SGG have primarily focused on capturing spatio-temporal context using complex architectures without addressing the challenges mentioned above, especially the long-tailed distribution of relationships. This often leads to the generation of biased scene graphs. To address these challenges, we introduce a new framework called TEMPURA: TEmporal consistency and Memory Prototype guided UnceRtainty Attenuation for unbiased dynamic SGG. TEMPURA employs object-level temporal consistencies via transformer-based sequence modeling, learns to synthesize unbiased relationship representations using memory-guided training, and attenuates the predictive uncertainty of visual relations using a Gaussian Mixture Model (GMM). Extensive experiments demonstrate that our method achieves significant (up to 10% in some cases) performance gain over existing methods highlighting its superiority in generating more unbiased scene graphs.
Self-supervised learning (SSL) based speech pre-training has attracted much attention for its capability of extracting rich representations learned from massive unlabeled data. On the other hand, the use of weakly-supervised data is less explored for speech pre-training. To fill this gap, we propose a weakly-supervised speech pre-training method based on speaker-aware speech data. It adopts a similar training procedure to the widely-used masked speech prediction based SSL framework, while incorporating additional target-speaker enrollment information as an auxiliary input. In this way, the learned representation is steered towards the target speaker even in the presence of highly overlapping interference, allowing potential applications to tasks such as target speech recognition. Our experiments on Libri2Mix and WSJ0-2mix datasets show that the proposed model achieves significantly better ASR performance compared to WavLM, the state-of-the-art SSL model with denoising capability.
High-end components that conduct complicated tasks automatically are a part of modern industrial systems. However, in order for these parts to function at the desired level, they need to be maintained by qualified experts. Solutions based on Augmented Reality (AR) have been established with the goal of raising production rates and quality while lowering maintenance costs. With the introduction of two unique interaction interfaces based on wearable targets and human face orientation, we are proposing hands-free advanced interactive solutions in this study with the goal of reducing the bias towards certain users. Using traditional devices in real time, a comparison investigation using alternative interaction interfaces is conducted. The suggested solutions are supported by various AI powered methods such as novel gravity-map based motion adjustment that is made possible by predictive deep models that reduce the bias of traditional hand- or finger-based interaction interfaces
The existing contrastive learning methods widely adopt one-hot instance discrimination as pretext task for self-supervised learning, which inevitably neglects rich inter-instance similarities among natural images, then leading to potential representation degeneration. In this paper, we propose a novel image mix method, PatchMix, for contrastive learning in Vision Transformer (ViT), to model inter-instance similarities among images. Following the nature of ViT, we randomly mix multiple images from mini-batch in patch level to construct mixed image patch sequences for ViT. Compared to the existing sample mix methods, our PatchMix can flexibly and efficiently mix more than two images and simulate more complicated similarity relations among natural images. In this manner, our contrastive framework can significantly reduce the gap between contrastive objective and ground truth in reality. Experimental results demonstrate that our proposed method significantly outperforms the previous state-of-the-art on both ImageNet-1K and CIFAR datasets, e.g., 3.0% linear accuracy improvement on ImageNet-1K and 8.7% kNN accuracy improvement on CIFAR100. Moreover, our method achieves the leading transfer performance on downstream tasks, object detection and instance segmentation on COCO dataset. The code is available at //github.com/visresearch/patchmix
Image-level weakly supervised semantic segmentation (WSSS) is a fundamental yet challenging computer vision task facilitating scene understanding and automatic driving. Most existing methods resort to classification-based Class Activation Maps (CAMs) to play as the initial pseudo labels, which tend to focus on the discriminative image regions and lack customized characteristics for the segmentation task. To alleviate this issue, we propose a novel activation modulation and recalibration (AMR) scheme, which leverages a spotlight branch and a compensation branch to obtain weighted CAMs that can provide recalibration supervision and task-specific concepts. Specifically, an attention modulation module (AMM) is employed to rearrange the distribution of feature importance from the channel-spatial sequential perspective, which helps to explicitly model channel-wise interdependencies and spatial encodings to adaptively modulate segmentation-oriented activation responses. Furthermore, we introduce a cross pseudo supervision for dual branches, which can be regarded as a semantic similar regularization to mutually refine two branches. Extensive experiments show that AMR establishes a new state-of-the-art performance on the PASCAL VOC 2012 dataset, surpassing not only current methods trained with the image-level of supervision but also some methods relying on stronger supervision, such as saliency label. Experiments also reveal that our scheme is plug-and-play and can be incorporated with other approaches to boost their performance.
Most object recognition approaches predominantly focus on learning discriminative visual patterns while overlooking the holistic object structure. Though important, structure modeling usually requires significant manual annotations and therefore is labor-intensive. In this paper, we propose to "look into object" (explicitly yet intrinsically model the object structure) through incorporating self-supervisions into the traditional framework. We show the recognition backbone can be substantially enhanced for more robust representation learning, without any cost of extra annotation and inference speed. Specifically, we first propose an object-extent learning module for localizing the object according to the visual patterns shared among the instances in the same category. We then design a spatial context learning module for modeling the internal structures of the object, through predicting the relative positions within the extent. These two modules can be easily plugged into any backbone networks during training and detached at inference time. Extensive experiments show that our look-into-object approach (LIO) achieves large performance gain on a number of benchmarks, including generic object recognition (ImageNet) and fine-grained object recognition tasks (CUB, Cars, Aircraft). We also show that this learning paradigm is highly generalizable to other tasks such as object detection and segmentation (MS COCO). Project page: //github.com/JDAI-CV/LIO.
Automatic KB completion for commonsense knowledge graphs (e.g., ATOMIC and ConceptNet) poses unique challenges compared to the much studied conventional knowledge bases (e.g., Freebase). Commonsense knowledge graphs use free-form text to represent nodes, resulting in orders of magnitude more nodes compared to conventional KBs (18x more nodes in ATOMIC compared to Freebase (FB15K-237)). Importantly, this implies significantly sparser graph structures - a major challenge for existing KB completion methods that assume densely connected graphs over a relatively smaller set of nodes. In this paper, we present novel KB completion models that can address these challenges by exploiting the structural and semantic context of nodes. Specifically, we investigate two key ideas: (1) learning from local graph structure, using graph convolutional networks and automatic graph densification and (2) transfer learning from pre-trained language models to knowledge graphs for enhanced contextual representation of knowledge. We describe our method to incorporate information from both these sources in a joint model and provide the first empirical results for KB completion on ATOMIC and evaluation with ranking metrics on ConceptNet. Our results demonstrate the effectiveness of language model representations in boosting link prediction performance and the advantages of learning from local graph structure (+1.5 points in MRR for ConceptNet) when training on subgraphs for computational efficiency. Further analysis on model predictions shines light on the types of commonsense knowledge that language models capture well.