This study introduces an online target sound extraction (TSE) process using the similarity-and-independence-aware beamformer (SIBF) derived from an iterative batch algorithm. The study aimed to reduce latency while maintaining extraction accuracy. The SIBF, which is a linear method, provides more accurate estimates of the target than an approximate magnitude spectrogram reference. The transition to an online algorithm reduces latency but presents challenges. First, contrary to the conventional assumption, deriving the online algorithm may degrade accuracy as compared to the batch algorithm using a sliding window. Second, conventional post-processing methods intended for scaling the estimated target may widen the accuracy gap between the two algorithms. This study adopts an approach that addresses these challenges and minimizes the accuracy gap during post-processing. It proposes a novel scaling method based on the single-channel Wiener filter (SWF-based scaling). To further improve accuracy, the study introduces a modified version of the time-frequency-varying variance generalized Gaussian distribution as a source model to represent the joint probability between the target and reference. Experimental results using the CHiME-3 dataset demonstrate several key findings: 1) SWF-based scaling effectively eliminates the gap between the two algorithms and improves accuracy. 2) The new source model achieves optimal accuracy, corresponding to the Laplacian model. 3) Our online SIBF outperforms conventional linear TSE methods, including independent vector extraction and minimum mean square error beamforming. These findings can contribute to the fields of beamforming and blind source separation.
Rate splitting multiple access (RSMA) is regarded as a crucial and powerful physical layer (PHY) paradigm for next-generation communication systems. Particularly, users employ successive interference cancellation (SIC) to decode part of the interference while treating the remainder as noise. However, conventional RSMA systems rely on fixed-position antenna arrays, limiting their ability to fully exploit spatial diversity. This constraint reduces beamforming gain and significantly impairs RSMA performance. To address this problem, we propose a movable antenna (MA)-aided RSMA scheme that allows the antennas at the base station (BS) to dynamically adjust their positions. Our objective is to maximize the system sum rate of common and private messages by jointly optimizing the MA positions, beamforming matrix, and common rate allocation. To tackle the formulated non-convex problem, we apply fractional programming (FP) and develop an efficient two-stage, coarse-to-fine-grained searching (CFGS) algorithm to obtain high-quality solutions. Numerical results demonstrate that, with optimized antenna adjustments, the MA-enabled system achieves substantial performance and reliability improvements in RSMA over fixed-position antenna setups.
We introduce a novel Graph Attention Autoencoder (GAE) with spatial regularization to address the challenge of scalable anomaly detection in spatiotemporal rainfall data across India from 1990 to 2015. Our model leverages a Graph Attention Network (GAT) to capture spatial dependencies and temporal dynamics in the data, further enhanced by a spatial regularization term ensuring geographic coherence. We construct two graph datasets employing rainfall, pressure, and temperature attributes from the Indian Meteorological Department and ERA5 Reanalysis on Single Levels, respectively. Our network operates on graph representations of the data, where nodes represent geographic locations, and edges, inferred through event synchronization, denote significant co-occurrences of rainfall events. Through extensive experiments, we demonstrate that our GAE effectively identifies anomalous rainfall patterns across the Indian landscape. Our work paves the way for sophisticated spatiotemporal anomaly detection methodologies in climate science, contributing to better climate change preparedness and response strategies.
Due to the advantages of high computational efficiency and small memory requirements, filter-based visual inertial odometry (VIO) has a good application prospect in miniaturized and payload-constrained embedded systems. However, the filter-based method has the problem of insufficient accuracy. To this end, we propose the State transformation and Pose-only VIO (SP-VIO) by rebuilding the state and measurement models, and considering further visual deprived conditions. In detail, we first proposed a system model based on the double state transformation extended Kalman filter (DST-EKF), which has been proven to have better observability and consistency than the models based on extended Kalman filter (EKF) and state transformation extended Kalman filter (ST-EKF). Secondly, to reduce the influence of linearization error caused by inaccurate 3D reconstruction, we adopt the Pose-only (PO) theory to decouple the measurement model from 3D features. Moreover, to deal with visual deprived conditions, we propose a double state transformation Rauch-Tung-Striebel (DST-RTS) backtracking method to optimize motion trajectories during visual interruption. Experiments on public (EuRoC, Tum-VI, KITTI) and personal datasets show that SP-VIO has better accuracy and efficiency than state-of-the-art (SOTA) VIO algorithms, and has better robustness under visual deprived conditions.
In the Fully Sharded Data Parallel (FSDP) training pipeline, collective operations can be interleaved to maximize the communication/computation overlap. In this scenario, outstanding operations such as Allgather and Reduce-Scatter can compete for the injection bandwidth and create pipeline bubbles. To address this problem, we propose a novel bandwidth-optimal Allgather collective algorithm that leverages hardware multicast. We use multicast to build a constant-time reliable Broadcast protocol, a building block for constructing an optimal Allgather schedule. Our Allgather algorithm achieves 2x traffic reduction on a 188-node testbed. To free the host side from running the protocol, we employ SmartNIC offloading. We extract the parallelism in our Allgather algorithm and map it to a SmartNIC specialized for hiding the cost of data movement. We show that our SmartNIC-offloaded collective progress engine can scale to the next generation of 1.6 Tbit/s links.
State-of-the-art performance in electroencephalography (EEG) decoding tasks is currently often achieved with either Deep-Learning (DL) or Riemannian-Geometry-based decoders (RBDs). Recently, there is growing interest in Deep Riemannian Networks (DRNs) possibly combining the advantages of both previous classes of methods. However, there are still a range of topics where additional insight is needed to pave the way for a more widespread application of DRNs in EEG. These include architecture design questions such as network size and end-to-end ability. How these factors affect model performance has not been explored. Additionally, it is not clear how the data within these networks is transformed, and whether this would correlate with traditional EEG decoding. Our study aims to lay the groundwork in the area of these topics through the analysis of DRNs for EEG with a wide range of hyperparameters. Networks were tested on five public EEG datasets and compared with state-of-the-art ConvNets. Here we propose EE(G)-SPDNet, and we show that this wide, end-to-end DRN can outperform the ConvNets, and in doing so use physiologically plausible frequency regions. We also show that the end-to-end approach learns more complex filters than traditional band-pass filters targeting the classical alpha, beta, and gamma frequency bands of the EEG, and that performance can benefit from channel specific filtering approaches. Additionally, architectural analysis revealed areas for further improvement due to the possible under utilisation of Riemannian specific information throughout the network. Our study thus shows how to design and train DRNs to infer task-related information from the raw EEG without the need of handcrafted filterbanks and highlights the potential of end-to-end DRNs such as EE(G)-SPDNet for high-performance EEG decoding.
We present Attend-Fusion, a novel and efficient approach for audio-visual fusion in video classification tasks. Our method addresses the challenge of exploiting both audio and visual modalities while maintaining a compact model architecture. Through extensive experiments on the YouTube-8M dataset, we demonstrate that our Attend-Fusion achieves competitive performance with significantly reduced model complexity compared to larger baseline models.
Object-centric learning (OCL) aims to learn representations of individual objects within visual scenes without manual supervision, facilitating efficient and effective visual reasoning. Traditional OCL methods primarily employ bottom-up approaches that aggregate homogeneous visual features to represent objects. However, in complex visual environments, these methods often fall short due to the heterogeneous nature of visual features within an object. To address this, we propose a novel OCL framework incorporating a top-down pathway. This pathway first bootstraps the semantics of individual objects and then modulates the model to prioritize features relevant to these semantics. By dynamically modulating the model based on its own output, our top-down pathway enhances the representational quality of objects. Our framework achieves state-of-the-art performance across multiple synthetic and real-world object-discovery benchmarks.
This study introduces a novel haptic device for enhancing dexterous manipulation in virtual reality. By stimulating mechanoreceptors on both sides of the fingernail, our lightweight system simulates tangential force sensations. We employ mechanical stimulation for more natural tactile feedback. A preliminary "balancing grasp challenge" experiment shows that users make more frequent micro-adjustments with our device, indicating improved precision. This research aims to advance haptic feedback in VR, potentially leading to more immersive and realistic virtual interactions.
Vision-language representation learning largely benefits from image-text alignment through contrastive losses (e.g., InfoNCE loss). The success of this alignment strategy is attributed to its capability in maximizing the mutual information (MI) between an image and its matched text. However, simply performing cross-modal alignment (CMA) ignores data potential within each modality, which may result in degraded representations. For instance, although CMA-based models are able to map image-text pairs close together in the embedding space, they fail to ensure that similar inputs from the same modality stay close by. This problem can get even worse when the pre-training data is noisy. In this paper, we propose triple contrastive learning (TCL) for vision-language pre-training by leveraging both cross-modal and intra-modal self-supervision. Besides CMA, TCL introduces an intra-modal contrastive objective to provide complementary benefits in representation learning. To take advantage of localized and structural information from image and text input, TCL further maximizes the average MI between local regions of image/text and their global summary. To the best of our knowledge, ours is the first work that takes into account local structure information for multi-modality representation learning. Experimental evaluations show that our approach is competitive and achieve the new state of the art on various common down-stream vision-language tasks such as image-text retrieval and visual question answering.
Unsupervised domain adaptation (UDA) methods for person re-identification (re-ID) aim at transferring re-ID knowledge from labeled source data to unlabeled target data. Although achieving great success, most of them only use limited data from a single-source domain for model pre-training, making the rich labeled data insufficiently exploited. To make full use of the valuable labeled data, we introduce the multi-source concept into UDA person re-ID field, where multiple source datasets are used during training. However, because of domain gaps, simply combining different datasets only brings limited improvement. In this paper, we try to address this problem from two perspectives, \ie{} domain-specific view and domain-fusion view. Two constructive modules are proposed, and they are compatible with each other. First, a rectification domain-specific batch normalization (RDSBN) module is explored to simultaneously reduce domain-specific characteristics and increase the distinctiveness of person features. Second, a graph convolutional network (GCN) based multi-domain information fusion (MDIF) module is developed, which minimizes domain distances by fusing features of different domains. The proposed method outperforms state-of-the-art UDA person re-ID methods by a large margin, and even achieves comparable performance to the supervised approaches without any post-processing techniques.