The flexible-position multiple-input multiple-output (FLP-MIMO), such as fluid antennas and movable antennas, is a promising technology for future wireless communications. This is due to the fact that the positions of antennas at the transceiver and reflector can be dynamically optimized to achieve better channel conditions and, as such, can provide high spectral efficiency (SE) and energy efficiency (EE) gains with fewer antennas. In this article, we introduce the fundamentals of FLP-MIMO systems, including hardware design, structure design, and potential applications. We shall demonstrate that FLP-MIMO, using fewer flexible antennas, can match the channel hardening achieved by a large number of fixed antennas. We will then analyze the SE-EE relationship for FLP-MIMO and fixed-position MIMO. Furthermore, we will design the optimal trajectory of flexible antennas to maximize system sum SE or total EE at a fixed travel distance of each antenna. Finally, several important research directions regarding FLP-MIMO communications are presented to facilitate further investigation.
Model compression is a crucial part of deploying neural networks (NNs), especially when the memory and storage of computing devices are limited in many applications. This paper focuses on two model compression techniques: low-rank approximation and weight pruning in neural networks, which are very popular nowadays. However, training NN with low-rank approximation and weight pruning always suffers significant accuracy loss and convergence issues. In this paper, a holistic framework is proposed for model compression from a novel perspective of nonconvex optimization by designing an appropriate objective function. Then, we introduce NN-BCD, a block coordinate descent (BCD) algorithm to solve the nonconvex optimization. One advantage of our algorithm is that an efficient iteration scheme can be derived with closed-form, which is gradient-free. Therefore, our algorithm will not suffer from vanishing/exploding gradient problems. Furthermore, with the Kurdyka-{\L}ojasiewicz (K{\L}) property of our objective function, we show that our algorithm globally converges to a critical point at the rate of O(1/k), where k denotes the number of iterations. Lastly, extensive experiments with tensor train decomposition and weight pruning demonstrate the efficiency and superior performance of the proposed framework. Our code implementation is available at //github.com/ChenyangLi-97/NN-BCD
Parameter-Efficient Fine-Tuning (PEFT) is increasingly recognized as an effective method in speech processing. However, the optimal approach and the placement of PEFT methods remain inconclusive. Our study conducts extensive experiments to compare different PEFT methods and their layer-wise placement adapting Differentiable Architecture Search (DARTS). We also explore the use of ensemble learning to leverage diverse PEFT strategies. The results reveal that DARTS does not outperform the baseline approach, which involves inserting the same PEFT method into all layers of a Self-Supervised Learning (SSL) model. In contrast, an ensemble learning approach, particularly one employing majority voting, demonstrates superior performance. Our statistical evidence indicates that different PEFT methods learn in varied ways. This variation might explain why the synergistic integration of various PEFT methods through ensemble learning can harness their unique learning capabilities more effectively compared to individual layer-wise optimization.
In the rapidly evolving field of artificial intelligence, the creation and utilization of synthetic datasets have become increasingly significant. This report delves into the multifaceted aspects of synthetic data, particularly emphasizing the challenges and potential biases these datasets may harbor. It explores the methodologies behind synthetic data generation, spanning traditional statistical models to advanced deep learning techniques, and examines their applications across diverse domains. The report also critically addresses the ethical considerations and legal implications associated with synthetic datasets, highlighting the urgent need for mechanisms to ensure fairness, mitigate biases, and uphold ethical standards in AI development.
Extremely large-scale multiple-input-multiple-output (XL-MIMO), which offers vast spatial degrees of freedom, has emerged as a potentially pivotal enabling technology for the sixth generation (6G) of wireless mobile networks. With its growing significance, both opportunities and challenges are concurrently manifesting. This paper presents a comprehensive survey of research on XL-MIMO wireless systems. In particular, we introduce four XL-MIMO hardware architectures: uniform linear array (ULA)-based XL-MIMO, uniform planar array (UPA)-based XL-MIMO utilizing either patch antennas or point antennas, and continuous aperture (CAP)-based XL-MIMO. We comprehensively analyze and discuss their characteristics and interrelationships. Following this, we introduce several electromagnetic characteristics and general distance boundaries in XL-MIMO. Given the distinct electromagnetic properties of near-field communications, we present a range of channel models to demonstrate the benefits of XL-MIMO. We further discuss and summarize signal processing schemes for XL-MIMO. It is worth noting that the low-complexity signal processing schemes and deep learning empowered signal processing schemes are reviewed and highlighted to promote the practical implementation of XL-MIMO. Furthermore, we explore the interplay between XL-MIMO and other emergent 6G technologies. Finally, we outline several compelling research directions for future XL-MIMO wireless communication systems.
Monte Carlo integration is fundamental in scientific and statistical computation, but requires reliable samples from the target distribution, which poses a substantial challenge in the case of multi-modal distributions. Existing methods often involve time-consuming tuning, and typically lack tailored estimators for efficient use of the samples. This paper adapts the Warp-U transformation [Wang et al., 2022] to form multi-modal sampling strategy called Warp-U sampling. It constructs a stochastic map to transport a multi-modal density into a uni-modal one, and subsequently inverts the transport but with new stochasticity injected. For efficient use of the samples for normalising constant estimation, we propose (i) an unbiased estimation scheme based coupled chains, where the Warp-U sampling is used to reduce the coupling time; and (ii) a stochastic Warp-U bridge sampling estimator, which improves its deterministic counterpart given in Wang et al. [2022]. Our overall approach requires less tuning and is easier to apply than common alternatives. Theoretically, we establish the ergodicity of our sampling algorithm and that our stochastic Warp-U bridge sampling estimator has greater (asymptotic) precision per CPU second compared to the Warp-U bridge estimator of Wang et al. [2022] under practical conditions. The advantages and current limitations of our approach are demonstrated through simulation studies and an application to exoplanet detection.
Multi-modal 3D scene understanding has gained considerable attention due to its wide applications in many areas, such as autonomous driving and human-computer interaction. Compared to conventional single-modal 3D understanding, introducing an additional modality not only elevates the richness and precision of scene interpretation but also ensures a more robust and resilient understanding. This becomes especially crucial in varied and challenging environments where solely relying on 3D data might be inadequate. While there has been a surge in the development of multi-modal 3D methods over past three years, especially those integrating multi-camera images (3D+2D) and textual descriptions (3D+language), a comprehensive and in-depth review is notably absent. In this article, we present a systematic survey of recent progress to bridge this gap. We begin by briefly introducing a background that formally defines various 3D multi-modal tasks and summarizes their inherent challenges. After that, we present a novel taxonomy that delivers a thorough categorization of existing methods according to modalities and tasks, exploring their respective strengths and limitations. Furthermore, comparative results of recent approaches on several benchmark datasets, together with insightful analysis, are offered. Finally, we discuss the unresolved issues and provide several potential avenues for future research.
With the advances of data-driven machine learning research, a wide variety of prediction problems have been tackled. It has become critical to explore how machine learning and specifically deep learning methods can be exploited to analyse healthcare data. A major limitation of existing methods has been the focus on grid-like data; however, the structure of physiological recordings are often irregular and unordered which makes it difficult to conceptualise them as a matrix. As such, graph neural networks have attracted significant attention by exploiting implicit information that resides in a biological system, with interactive nodes connected by edges whose weights can be either temporal associations or anatomical junctions. In this survey, we thoroughly review the different types of graph architectures and their applications in healthcare. We provide an overview of these methods in a systematic manner, organized by their domain of application including functional connectivity, anatomical structure and electrical-based analysis. We also outline the limitations of existing techniques and discuss potential directions for future research.
Most existing event extraction (EE) methods merely extract event arguments within the sentence scope. However, such sentence-level EE methods struggle to handle soaring amounts of documents from emerging applications, such as finance, legislation, health, etc., where event arguments always scatter across different sentences, and even multiple such event mentions frequently co-exist in the same document. To address these challenges, we propose a novel end-to-end model, Doc2EDAG, which can generate an entity-based directed acyclic graph to fulfill the document-level EE (DEE) effectively. Moreover, we reformalize a DEE task with the no-trigger-words design to ease the document-level event labeling. To demonstrate the effectiveness of Doc2EDAG, we build a large-scale real-world dataset consisting of Chinese financial announcements with the challenges mentioned above. Extensive experiments with comprehensive analyses illustrate the superiority of Doc2EDAG over state-of-the-art methods. Data and codes can be found at //github.com/dolphin-zs/Doc2EDAG.
The problem of Multiple Object Tracking (MOT) consists in following the trajectory of different objects in a sequence, usually a video. In recent years, with the rise of Deep Learning, the algorithms that provide a solution to this problem have benefited from the representational power of deep models. This paper provides a comprehensive survey on works that employ Deep Learning models to solve the task of MOT on single-camera videos. Four main steps in MOT algorithms are identified, and an in-depth review of how Deep Learning was employed in each one of these stages is presented. A complete experimental comparison of the presented works on the three MOTChallenge datasets is also provided, identifying a number of similarities among the top-performing methods and presenting some possible future research directions.
Most existing works in visual question answering (VQA) are dedicated to improving the accuracy of predicted answers, while disregarding the explanations. We argue that the explanation for an answer is of the same or even more importance compared with the answer itself, since it makes the question and answering process more understandable and traceable. To this end, we propose a new task of VQA-E (VQA with Explanation), where the computational models are required to generate an explanation with the predicted answer. We first construct a new dataset, and then frame the VQA-E problem in a multi-task learning architecture. Our VQA-E dataset is automatically derived from the VQA v2 dataset by intelligently exploiting the available captions. We have conducted a user study to validate the quality of explanations synthesized by our method. We quantitatively show that the additional supervision from explanations can not only produce insightful textual sentences to justify the answers, but also improve the performance of answer prediction. Our model outperforms the state-of-the-art methods by a clear margin on the VQA v2 dataset.