Despite the progress seen in classification methods, current approaches for handling videos with distribution shifts in source and target domains remain source-dependent as they require access to the source data during the adaptation stage. In this paper, we present a self-training based source-free video domain adaptation approach to address this challenge by bridging the gap between the source and the target domains. We use the source pre-trained model to generate pseudo-labels for the target domain samples, which are inevitably noisy. Thus, we treat the problem of source-free video domain adaptation as learning from noisy labels and argue that the samples with correct pseudo-labels can help us in adaptation. To this end, we leverage the cross-entropy loss as an indicator of the correctness of the pseudo-labels and use the resulting small-loss samples from the target domain for fine-tuning the model. We further enhance the adaptation performance by implementing a teacher-student framework, in which the teacher, which is updated gradually, produces reliable pseudo-labels. Meanwhile, the student undergoes fine-tuning on the target domain videos using these generated pseudo-labels to improve its performance. Extensive experimental evaluations show that our methods, termed as CleanAdapt, CleanAdapt + TS, achieve state-of-the-art results, outperforming the existing approaches on various open datasets. Our source code is publicly available at //avijit9.github.io/CleanAdapt.
We investigate the application of the factor graph framework for blind joint channel estimation and symbol detection on time-variant linear inter-symbol interference channels. In particular, we consider the expectation maximization (EM) algorithm for maximum likelihood estimation, which typically suffers from high complexity as it requires the computation of the symbol-wise posterior distributions in every iteration. We address this issue by efficiently approximating the posteriors using the belief propagation (BP) algorithm on a suitable factor graph. By interweaving the iterations of BP and EM, the detection complexity can be further reduced to a single BP iteration per EM step. In addition, we propose a data-driven version of our algorithm that introduces momentum in the BP updates and learns a suitable EM parameter update schedule, thereby significantly improving the performance-complexity tradeoff with a few offline training samples. Our numerical experiments demonstrate the excellent performance of the proposed blind detector and show that it even outperforms coherent BP detection in high signal-to-noise scenarios.
Many industrial tasks-such as sanding, installing fasteners, and wire harnessing-are difficult to automate due to task complexity and variability. We instead investigate deploying robots in an assistive role for these tasks, where the robot assumes the physical task burden and the skilled worker provides both the high-level task planning and low-level feedback necessary to effectively complete the task. In this article, we describe the development of a system for flexible human-robot teaming that combines state-of-the-art methods in end-user programming and shared autonomy and its implementation in sanding applications. We demonstrate the use of the system in two types of sanding tasks, situated in aircraft manufacturing, that highlight two potential workflows within the human-robot teaming setup. We conclude by discussing challenges and opportunities in human-robot teaming identified during the development, application, and demonstration of our system.
The pre-trained text-image discriminative models, such as CLIP, has been explored for open-vocabulary semantic segmentation with unsatisfactory results due to the loss of crucial localization information and awareness of object shapes. Recently, there has been a growing interest in expanding the application of generative models from generation tasks to semantic segmentation. These approaches utilize generative models either for generating annotated data or extracting features to facilitate semantic segmentation. This typically involves generating a considerable amount of synthetic data or requiring additional mask annotations. To this end, we uncover the potential of generative text-to-image diffusion models (e.g., Stable Diffusion) as highly efficient open-vocabulary semantic segmenters, and introduce a novel training-free approach named DiffSegmenter. The insight is that to generate realistic objects that are semantically faithful to the input text, both the complete object shapes and the corresponding semantics are implicitly learned by diffusion models. We discover that the object shapes are characterized by the self-attention maps while the semantics are indicated through the cross-attention maps produced by the denoising U-Net, forming the basis of our segmentation results.Additionally, we carefully design effective textual prompts and a category filtering mechanism to further enhance the segmentation results. Extensive experiments on three benchmark datasets show that the proposed DiffSegmenter achieves impressive results for open-vocabulary semantic segmentation.
The manipulation of deformable objects by robotic systems presents a significant challenge due to their complex and infinite-dimensional configuration spaces. This paper introduces a novel approach to Deformable Object Manipulation (DOM) by emphasizing the identification and manipulation of Structures of Interest (SOIs) in deformable fabric bags. We propose a bimanual manipulation framework that leverages a Graph Neural Network (GNN)-based latent dynamics model to succinctly represent and predict the behavior of these SOIs. Our approach involves constructing a graph representation from partial point cloud data of the object and learning the latent dynamics model that effectively captures the essential deformations of the fabric bag within a reduced computational space. By integrating this latent dynamics model with Model Predictive Control (MPC), we empower robotic manipulators to perform precise and stable manipulation tasks focused on the SOIs. We have validated our framework through various empirical experiments demonstrating its efficacy in bimanual manipulation of fabric bags. Our contributions not only address the complexities inherent in DOM but also provide new perspectives and methodologies for enhancing robotic interactions with deformable objects by concentrating on their critical structural elements. Experimental videos can be obtained from //sites.google.com/view/bagbot.
Integrating different functionalities, conventionally implemented as dedicated systems, into a single platform allows utilising the available resources more efficiently. We consider an integrated sensing and power transfer (ISAPT) system and propose the joint optimisation of the rectangular pulse-shaped transmit signal and the beamforming vector to combine sensing and wireless power transfer (WPT) functionalities efficiently. In contrast to prior works, we adopt an accurate non-linear circuit-based energy harvesting (EH) model. We formulate and solve a non-convex optimisation problem for a general number of EH receivers to maximise a weighted sum of the average harvested powers at the EH receivers while ensuring the received echo signal reflected by a sensing target (ST) has sufficient power for estimating the range to the ST with a prescribed accuracy within the considered coverage region. The average harvested power is shown to monotonically increase with the pulse duration when the average transmit power budget is sufficiently large. We discuss the trade-off between sensing performance and power transfer for the considered ISAPT system. The proposed approach significantly outperforms a heuristic baseline scheme based on a linear EH model, which linearly combines energy beamforming with the beamsteering vector in the direction to the ST as its transmit strategy.
Recent advancements in deep learning-based image compression are notable. However, prevalent schemes that employ a serial context-adaptive entropy model to enhance rate-distortion (R-D) performance are markedly slow. Furthermore, the complexities of the encoding and decoding networks are substantially high, rendering them unsuitable for some practical applications. In this paper, we propose two techniques to balance the trade-off between complexity and performance. First, we introduce two branching coding networks to independently learn a low-resolution latent representation and a high-resolution latent representation of the input image, discriminatively representing the global and local information therein. Second, we utilize the high-resolution latent representation as conditional information for the low-resolution latent representation, furnishing it with global information, thus aiding in the reduction of redundancy between low-resolution information. We do not utilize any serial entropy models. Instead, we employ a parallel channel-wise auto-regressive entropy model for encoding and decoding low-resolution and high-resolution latent representations. Experiments demonstrate that our method is approximately twice as fast in both encoding and decoding compared to the parallelizable checkerboard context model, and it also achieves a 1.2% improvement in R-D performance compared to state-of-the-art learned image compression schemes. Our method also outperforms classical image codecs including H.266/VVC-intra (4:4:4) and some recent learned methods in rate-distortion performance, as validated by both PSNR and MS-SSIM metrics on the Kodak dataset.
Pseudorange errors are the root cause of localization inaccuracy in GPS. Previous data-driven methods regress and eliminate pseudorange errors using handcrafted intermediate labels. Unlike them, we propose an end-to-end GPS localization framework, E2E-PrNet, to train a neural network for pseudorange correction (PrNet) directly using the final task loss calculated with the ground truth of GPS receiver states. The gradients of the loss with respect to learnable parameters are backpropagated through a differentiable nonlinear least squares optimizer to PrNet. The feasibility is verified with GPS data collected by Android phones, showing that E2E-PrNet outperforms the state-of-the-art end-to-end GPS localization methods.
Recent advances in maximizing mutual information (MI) between the source and target have demonstrated its effectiveness in text generation. However, previous works paid little attention to modeling the backward network of MI (i.e., dependency from the target to the source), which is crucial to the tightness of the variational information maximization lower bound. In this paper, we propose Adversarial Mutual Information (AMI): a text generation framework which is formed as a novel saddle point (min-max) optimization aiming to identify joint interactions between the source and target. Within this framework, the forward and backward networks are able to iteratively promote or demote each other's generated instances by comparing the real and synthetic data distributions. We also develop a latent noise sampling strategy that leverages random variations at the high-level semantic space to enhance the long term dependency in the generation process. Extensive experiments based on different text generation tasks demonstrate that the proposed AMI framework can significantly outperform several strong baselines, and we also show that AMI has potential to lead to a tighter lower bound of maximum mutual information for the variational information maximization problem.
Translational distance-based knowledge graph embedding has shown progressive improvements on the link prediction task, from TransE to the latest state-of-the-art RotatE. However, N-1, 1-N and N-N predictions still remain challenging. In this work, we propose a novel translational distance-based approach for knowledge graph link prediction. The proposed method includes two-folds, first we extend the RotatE from 2D complex domain to high dimension space with orthogonal transforms to model relations for better modeling capacity. Second, the graph context is explicitly modeled via two directed context representations. These context representations are used as part of the distance scoring function to measure the plausibility of the triples during training and inference. The proposed approach effectively improves prediction accuracy on the difficult N-1, 1-N and N-N cases for knowledge graph link prediction task. The experimental results show that it achieves better performance on two benchmark data sets compared to the baseline RotatE, especially on data set (FB15k-237) with many high in-degree connection nodes.
High spectral dimensionality and the shortage of annotations make hyperspectral image (HSI) classification a challenging problem. Recent studies suggest that convolutional neural networks can learn discriminative spatial features, which play a paramount role in HSI interpretation. However, most of these methods ignore the distinctive spectral-spatial characteristic of hyperspectral data. In addition, a large amount of unlabeled data remains an unexploited gold mine for efficient data use. Therefore, we proposed an integration of generative adversarial networks (GANs) and probabilistic graphical models for HSI classification. Specifically, we used a spectral-spatial generator and a discriminator to identify land cover categories of hyperspectral cubes. Moreover, to take advantage of a large amount of unlabeled data, we adopted a conditional random field to refine the preliminary classification results generated by GANs. Experimental results obtained using two commonly studied datasets demonstrate that the proposed framework achieved encouraging classification accuracy using a small number of data for training.