Ensuring extremely high reliability is essential for channel coding in 6G networks. The next-generation of ultra-reliable and low-latency communications (xURLLC) scenario within 6G networks requires a frame error rate (FER) below 10-9. However, low-density parity-check (LDPC) codes, the standard in 5G new radio (NR), encounter a challenge known as the error floor phenomenon, which hinders to achieve such low rates. To tackle this problem, we introduce an innovative solution: boosted neural min-sum (NMS) decoder. This decoder operates identically to conventional NMS decoders, but is trained by novel training methods including: i) boosting learning with uncorrected vectors, ii) block-wise training schedule to address the vanishing gradient issue, iii) dynamic weight sharing to minimize the number of trainable weights, iv) transfer learning to reduce the required sample count, and v) data augmentation to expedite the sampling process. Leveraging these training strategies, the boosted NMS decoder achieves the state-of-the art performance in reducing the error floor as well as superior waterfall performance. Remarkably, we fulfill the 6G xURLLC requirement for 5G LDPC codes without the severe error floor. Additionally, the boosted NMS decoder, once its weights are trained, can perform decoding without additional modules, making it highly practical for immediate application.
Deep learning models are widely used for speaker recognition and spoofing speech detection. We propose the GMM-ResNet2 for synthesis speech detection. Compared with the previous GMM-ResNet model, GMM-ResNet2 has four improvements. Firstly, the different order GMMs have different capabilities to form smooth approximations to the feature distribution, and multiple GMMs are used to extract multi-scale Log Gaussian Probability features. Secondly, the grouping technique is used to improve the classification accuracy by exposing the group cardinality while reducing both the number of parameters and the training time. The final score is obtained by ensemble of all group classifier outputs using the averaging method. Thirdly, the residual block is improved by including one activation function and one batch normalization layer. Finally, an ensemble-aware loss function is proposed to integrate the independent loss functions of all ensemble members. On the ASVspoof 2019 LA task, the GMM-ResNet2 achieves a minimum t-DCF of 0.0227 and an EER of 0.79\%. On the ASVspoof 2021 LA task, the GMM-ResNet2 achieves a minimum t-DCF of 0.2362 and an EER of 2.19\%, and represents a relative reductions of 31.4\% and 76.3\% compared with the LFCC-LCNN baseline.
With advancements in hardware, high-quality HMD devices are being developed by numerous companies, driving increased consumer interest in AR, VR, and MR applications. In this work, we present a new dataset, called VRBiom, of periocular videos acquired using a Virtual Reality headset. The VRBiom, targeted at biometric applications, consists of 900 short videos acquired from 25 individuals recorded in the NIR spectrum. These 10s long videos have been captured using the internal tracking cameras of Meta Quest Pro at 72 FPS. To encompass real-world variations, the dataset includes recordings under three gaze conditions: steady, moving, and partially closed eyes. We have also ensured an equal split of recordings without and with glasses to facilitate the analysis of eye-wear. These videos, characterized by non-frontal views of the eye and relatively low spatial resolutions (400 x 400), can be instrumental in advancing state-of-the-art research across various biometric applications. The VRBiom dataset can be utilized to evaluate, train, or adapt models for biometric use-cases such as iris and/or periocular recognition and associated sub-tasks such as detection and semantic segmentation. In addition to data from real individuals, we have included around 1100 PA constructed from 92 PA instruments. These PAIs fall into six categories constructed through combinations of print attacks (real and synthetic identities), fake 3D eyeballs, plastic eyes, and various types of masks and mannequins. These PA videos, combined with genuine (bona-fide) data, can be utilized to address concerns related to spoofing, which is a significant threat if these devices are to be used for authentication. The VRBiom dataset is publicly available for research purposes related to biometric applications only.
Applying deep neural networks to 3D point cloud processing has attracted increasing attention due to its advanced performance in many areas, such as AR/VR, autonomous driving, and robotics. However, as neural network models and 3D point clouds expand in size, it becomes a crucial challenge to reduce the computational and memory overhead to meet latency and energy constraints in real-world applications. Although existing approaches have proposed to reduce both computational cost and memory footprint, most of them only address the spatial redundancy in inputs, i.e. removing the redundancy of background points in 3D data. In this paper, we propose a novel post-training weight pruning scheme for 3D object detection that is (1) orthogonal to all existing point cloud sparsifying methods, which determines redundant parameters in the pretrained model that lead to minimal distortion in both locality and confidence (detection distortion); and (2) a universal plug-and-play pruning framework that works with arbitrary 3D detection model. This framework aims to minimize detection distortion of network output to maximally maintain detection precision, by identifying layer-wise sparsity based on second-order Taylor approximation of the distortion. Albeit utilizing second-order information, we introduced a lightweight scheme to efficiently acquire Hessian information, and subsequently perform dynamic programming to solve the layer-wise sparsity. Extensive experiments on KITTI, Nuscenes and ONCE datasets demonstrate that our approach is able to maintain and even boost the detection precision on pruned model under noticeable computation reduction (FLOPs). Noticeably, we achieve over 3.89x, 3.72x FLOPs reduction on CenterPoint and PVRCNN model, respectively, without mAP decrease, significantly improving the state-of-the-art.
Large-scale video conferencing services incur significant network cost while serving surging global demands. Our work systematically explores the opportunity to offload a fraction of this traffic to the Internet, a cheaper routing option offered already by cloud providers, from WAN without drop in application performance. First, with a large-scale latency measurement study with 3.5 million data points per day spanning 241K source cities and 21 data centers across the globe, we demonstrate that Internet paths perform comparable to or better than the private WAN for parts of the world (e.g., Europe and North America). Next, we present Titan, a live (12+ months) production system that carefully moves a fraction of the conferencing traffic to the Internet using the above observation. Finally, we propose Titan-Next, a research prototype that jointly assigns the conferencing server and routing option (Internet or WAN) for individual calls. With 5 weeks of production data, we show Titan-Next reduces the sum of peak bandwidth on WAN links that defines the operational network cost by up to 61% compared to state-of-the-art baselines. We will open-source parts of the measurement data.
Diffusion models have demonstrated powerful data generation capabilities in various research fields such as image generation. However, in the field of vibration signal generation, the criteria for evaluating the quality of the generated signal are different from that of image generation and there is a fundamental difference between them. At present, there is no research on the ability of diffusion model to generate vibration signal. In this paper, a Time Series Diffusion Method (TSDM) is proposed for vibration signal generation, leveraging the foundational principles of diffusion models. The TSDM uses an improved U-net architecture with attention block, ResBlock and TimeEmbedding to effectively segment and extract features from one-dimensional time series data. It operates based on forward diffusion and reverse denoising processes for time-series generation. Experimental validation is conducted using single-frequency, multi-frequency datasets, and bearing fault datasets. The results show that TSDM can accurately generate the single-frequency and multi-frequency features in the time series and retain the basic frequency features for the diffusion generation results of the bearing fault series. It is also found that the original DDPM could not generate high quality vibration signals, but the improved U-net in TSDM, which applied the combination of attention block and ResBlock, could effectively improve the quality of vibration signal generation. Finally, TSDM is applied to the small sample fault diagnosis of three public bearing fault datasets, and the results show that the accuracy of small sample fault diagnosis of the three datasets is improved by 32.380%, 18.355% and 9.298% at most, respectively.
Social media popularity (SMP) prediction is a complex task involving multi-modal data integration. While pre-trained vision-language models (VLMs) like CLIP have been widely adopted for this task, their effectiveness in capturing the unique characteristics of social media content remains unexplored. This paper critically examines the applicability of CLIP-based features in SMP prediction, focusing on the overlooked phenomenon of semantic inconsistency between images and text in social media posts. Through extensive analysis, we demonstrate that this inconsistency increases with post popularity, challenging the conventional use of VLM features. We provide a comprehensive investigation of semantic inconsistency across different popularity intervals and analyze the impact of VLM feature adaptation on SMP tasks. Our experiments reveal that incorporating inconsistency measures and adapted text features significantly improves model performance, achieving an SRC of 0.729 and an MAE of 1.227. These findings not only enhance SMP prediction accuracy but also provide crucial insights for developing more targeted approaches in social media analysis.
Following multiple instructions is a crucial ability for large language models (LLMs). Evaluating this ability comes with significant challenges: (i) limited coherence between multiple instructions, (ii) positional bias where the order of instructions affects model performance, and (iii) a lack of objectively verifiable tasks. To address these issues, we introduce a benchmark designed to evaluate models' abilities to follow multiple instructions through sequential instruction following (SIFo) tasks. In SIFo, the successful completion of multiple instructions is verifiable by examining only the final instruction. Our benchmark evaluates instruction following using four tasks (text modification, question answering, mathematics, and security rule following), each assessing different aspects of sequential instruction following. Our evaluation of popular LLMs, both closed-source and open-source, shows that more recent and larger models significantly outperform their older and smaller counterparts on the SIFo tasks, validating the benchmark's effectiveness. All models struggle with following sequences of instructions, hinting at an important lack of robustness of today's language models.
Similar to vision-and-language navigation (VLN) tasks that focus on bridging the gap between vision and language for embodied navigation, the new Rendezvous (RVS) task requires reasoning over allocentric spatial relationships (independent of the observer's viewpoint) using non-sequential navigation instructions and maps. However, performance substantially drops in new environments with no training data. Using opensource descriptions paired with coordinates (e.g., Wikipedia) provides training data but suffers from limited spatially-oriented text resulting in low geolocation resolution. We propose a large-scale augmentation method for generating high-quality synthetic data for new environments using readily available geospatial data. Our method constructs a grounded knowledge-graph, capturing entity relationships. Sampled entities and relations (`shop north of school') generate navigation instructions via (i) generating numerous templates using context-free grammar (CFG) to embed specific entities and relations; (ii) feeding the entities and relation into a large language model (LLM) for instruction generation. A comprehensive evaluation on RVS, showed that our approach improves the 100-meter accuracy by 45.83% on unseen environments. Furthermore, we demonstrate that models trained with CFG-based augmentation achieve superior performance compared with those trained with LLM-based augmentation, both in unseen and seen environments. These findings suggest that the potential advantages of explicitly structuring spatial information for text-based geospatial reasoning in previously unknown, can unlock data-scarce scenarios.
Graph neural networks (GNNs) are effective machine learning models for many graph-related applications. Despite their empirical success, many research efforts focus on the theoretical limitations of GNNs, i.e., the GNNs expressive power. Early works in this domain mainly focus on studying the graph isomorphism recognition ability of GNNs, and recent works try to leverage the properties such as subgraph counting and connectivity learning to characterize the expressive power of GNNs, which are more practical and closer to real-world. However, no survey papers and open-source repositories comprehensively summarize and discuss models in this important direction. To fill the gap, we conduct a first survey for models for enhancing expressive power under different forms of definition. Concretely, the models are reviewed based on three categories, i.e., Graph feature enhancement, Graph topology enhancement, and GNNs architecture enhancement.
Graph neural networks (GNNs) have demonstrated a significant boost in prediction performance on graph data. At the same time, the predictions made by these models are often hard to interpret. In that regard, many efforts have been made to explain the prediction mechanisms of these models from perspectives such as GNNExplainer, XGNN and PGExplainer. Although such works present systematic frameworks to interpret GNNs, a holistic review for explainable GNNs is unavailable. In this survey, we present a comprehensive review of explainability techniques developed for GNNs. We focus on explainable graph neural networks and categorize them based on the use of explainable methods. We further provide the common performance metrics for GNNs explanations and point out several future research directions.