Acoustic spatial capture-recapture (ASCR) surveys with an array of synchronized acoustic detectors can be an effective way of estimating animal density or call density. However, constructing the capture histories required for ASCR analysis is challenging, as recognizing which detections at different detectors are of which calls is not a trivial task. Because calls from different distances take different times to arrive at detectors, the order in which calls are detected is not necessarily the same as the order in which they are made, and without knowing which detections are of the same call, we do not know how many different calls are detected. We propose a Monte Carlo expectation-maximization (MCEM) estimation method to resolve this unknown call identity problem. To implement the MCEM method in this context, we sample the latent variables from a complete-data likelihood model in the expectation step and use a semi-complete-data likelihood or conditional likelihood in the maximization step. We use a parametric bootstrap to obtain confidence intervals. When we apply our method to a survey of moss frogs, it gives an estimate within 15% of the estimate obtained using data with call capture histories constructed by experts, and unlike this latter estimate, our confidence interval incorporates the uncertainty about call identities. Simulations show it to have a low bias (6%) and coverage probabilities close to the nominal 95% value.
Rate splitting multiple access (RSMA) is regarded as a crucial and powerful physical layer (PHY) paradigm for next-generation communication systems. Particularly, users employ successive interference cancellation (SIC) to decode part of the interference while treating the remainder as noise. However, conventional RSMA systems rely on fixed-position antenna arrays, limiting their ability to fully exploit spatial diversity. This constraint reduces beamforming gain and significantly impairs RSMA performance. To address this problem, we propose a movable antenna (MA)-aided RSMA scheme that allows the antennas at the base station (BS) to dynamically adjust their positions. Our objective is to maximize the system sum rate of common and private messages by jointly optimizing the MA positions, beamforming matrix, and common rate allocation. To tackle the formulated non-convex problem, we apply fractional programming (FP) and develop an efficient two-stage, coarse-to-fine-grained searching (CFGS) algorithm to obtain high-quality solutions. Numerical results demonstrate that, with optimized antenna adjustments, the MA-enabled system achieves substantial performance and reliability improvements in RSMA over fixed-position antenna setups.
Unmanned aerial vehicles (UAVs) are gaining widespread use in wireless relay systems due to their exceptional flexibility and cost-effectiveness. This paper focuses on the integrated design of UAV trajectories and the precoders at both the transmitter and UAV in a UAV-assisted relay communication system, accounting for transmit power constraints and UAV flight limitations. Unlike previous works that primarily address multiple-input single-output (MISO) systems with Gaussian inputs, we investigate a more realistic scenario involving multiple-input multiple-output (MIMO) systems with finite-alphabet inputs. To tackle the challenging and inherently non-convex problem, we propose an efficient solution algorithm that leverages successive convex approximation and alternating optimization techniques. Simulation results validate the effectiveness of the proposed algorithm, demonstrating its capability to optimize system performance.
Second-order optimization has been shown to accelerate the training of deep neural networks in many applications, often yielding faster progress per iteration on the training loss compared to first-order optimizers. However, the generalization properties of second-order methods are still being debated. Theoretical investigations have proved difficult to carry out outside the tractable settings of heavily simplified model classes -- thus, the relevance of existing theories to practical deep learning applications remains unclear. Similarly, empirical studies in large-scale models and real datasets are significantly confounded by the necessity to approximate second-order updates in practice. It is often unclear whether the observed generalization behaviour arises specifically from the second-order nature of the parameter updates, or instead reflects the specific structured (e.g.\ Kronecker) approximations used or any damping-based interpolation towards first-order updates. Here, we show for the first time that exact Gauss-Newton (GN) updates take on a tractable form in a class of deep reversible architectures that are sufficiently expressive to be meaningfully applied to common benchmark datasets. We exploit this novel setting to study the training and generalization properties of the GN optimizer. We find that exact GN generalizes poorly. In the mini-batch training setting, this manifests as rapidly saturating progress even on the \emph{training} loss, with parameter updates found to overfit each mini-batchatch without producing the features that would support generalization to other mini-batches. We show that our experiments run in the ``lazy'' regime, in which the neural tangent kernel (NTK) changes very little during the course of training. This behaviour is associated with having no significant changes in neural representations, explaining the lack of generalization.
Neural radiance fields (NeRFs) are a powerful tool for implicit scene representations, allowing for differentiable rendering and the ability to make predictions about unseen viewpoints. There has been growing interest in object and scene-based localisation using NeRFs, with a number of recent works relying on sampling-based or Monte-Carlo localisation schemes. Unfortunately, these can be extremely computationally expensive, requiring multiple network forward passes to infer camera or object pose. To alleviate this, a variety of sampling strategies have been applied, many relying on keypoint recognition techniques from classical computer vision. This work conducts a systematic empirical comparison of these approaches and shows that in contrast to conventional feature matching approaches for geometry-based localisation, sampling-based localisation using NeRFs benefits significantly from stable features. Results show that rendering stable features provides significantly better estimation with a tenfold reduction in the number of forward passes required.
In multi-modal learning, some modalities are more influential than others, and their absence can have a significant impact on classification/segmentation accuracy. Addressing this challenge, we propose a novel approach called Meta-learned Modality-weighted Knowledge Distillation (MetaKD), which enables multi-modal models to maintain high accuracy even when key modalities are missing. MetaKD adaptively estimates the importance weight of each modality through a meta-learning process. These learned importance weights guide a pairwise modality-weighted knowledge distillation process, allowing high-importance modalities to transfer knowledge to lower-importance ones, resulting in robust performance despite missing inputs. Unlike previous methods in the field, which are often task-specific and require significant modifications, our approach is designed to work in multiple tasks (e.g., segmentation and classification) with minimal adaptation. Experimental results on five prevalent datasets, including three Brain Tumor Segmentation datasets (BraTS2018, BraTS2019 and BraTS2020), the Alzheimer's Disease Neuroimaging Initiative (ADNI) classification dataset and the Audiovision-MNIST classification dataset, demonstrate the proposed model is able to outperform the compared models by a large margin.
Relation extraction as an important natural Language processing (NLP) task is to identify relations between named entities in text. Recently, graph convolutional networks over dependency trees have been widely used to capture syntactic features and achieved attractive performance. However, most existing dependency-based approaches ignore the positive influence of the words outside the dependency trees, sometimes conveying rich and useful information on relation extraction. In this paper, we propose a novel model, Entity-aware Self-attention Contextualized GCN (ESC-GCN), which efficiently incorporates syntactic structure of input sentences and semantic context of sequences. To be specific, relative position self-attention obtains the overall semantic pairwise correlation related to word position, and contextualized graph convolutional networks capture rich intra-sentence dependencies between words by adequately pruning operations. Furthermore, entity-aware attention layer dynamically selects which token is more decisive to make final relation prediction. In this way, our proposed model not only reduces the noisy impact from dependency trees, but also obtains easily-ignored entity-related semantic representation. Extensive experiments on various tasks demonstrate that our model achieves encouraging performance as compared to existing dependency-based and sequence-based models. Specially, our model excels in extracting relations between entities of long sentences.
The existence of representative datasets is a prerequisite of many successful artificial intelligence and machine learning models. However, the subsequent application of these models often involves scenarios that are inadequately represented in the data used for training. The reasons for this are manifold and range from time and cost constraints to ethical considerations. As a consequence, the reliable use of these models, especially in safety-critical applications, is a huge challenge. Leveraging additional, already existing sources of knowledge is key to overcome the limitations of purely data-driven approaches, and eventually to increase the generalization capability of these models. Furthermore, predictions that conform with knowledge are crucial for making trustworthy and safe decisions even in underrepresented scenarios. This work provides an overview of existing techniques and methods in the literature that combine data-based models with existing knowledge. The identified approaches are structured according to the categories integration, extraction and conformity. Special attention is given to applications in the field of autonomous driving.
Graph Convolutional Networks (GCNs) have been widely applied in various fields due to their significant power on processing graph-structured data. Typical GCN and its variants work under a homophily assumption (i.e., nodes with same class are prone to connect to each other), while ignoring the heterophily which exists in many real-world networks (i.e., nodes with different classes tend to form edges). Existing methods deal with heterophily by mainly aggregating higher-order neighborhoods or combing the immediate representations, which leads to noise and irrelevant information in the result. But these methods did not change the propagation mechanism which works under homophily assumption (that is a fundamental part of GCNs). This makes it difficult to distinguish the representation of nodes from different classes. To address this problem, in this paper we design a novel propagation mechanism, which can automatically change the propagation and aggregation process according to homophily or heterophily between node pairs. To adaptively learn the propagation process, we introduce two measurements of homophily degree between node pairs, which is learned based on topological and attribute information, respectively. Then we incorporate the learnable homophily degree into the graph convolution framework, which is trained in an end-to-end schema, enabling it to go beyond the assumption of homophily. More importantly, we theoretically prove that our model can constrain the similarity of representations between nodes according to their homophily degree. Experiments on seven real-world datasets demonstrate that this new approach outperforms the state-of-the-art methods under heterophily or low homophily, and gains competitive performance under homophily.
Substantial efforts have been devoted more recently to presenting various methods for object detection in optical remote sensing images. However, the current survey of datasets and deep learning based methods for object detection in optical remote sensing images is not adequate. Moreover, most of the existing datasets have some shortcomings, for example, the numbers of images and object categories are small scale, and the image diversity and variations are insufficient. These limitations greatly affect the development of deep learning based object detection methods. In the paper, we provide a comprehensive review of the recent deep learning based object detection progress in both the computer vision and earth observation communities. Then, we propose a large-scale, publicly available benchmark for object DetectIon in Optical Remote sensing images, which we name as DIOR. The dataset contains 23463 images and 192472 instances, covering 20 object classes. The proposed DIOR dataset 1) is large-scale on the object categories, on the object instance number, and on the total image number; 2) has a large range of object size variations, not only in terms of spatial resolutions, but also in the aspect of inter- and intra-class size variability across objects; 3) holds big variations as the images are obtained with different imaging conditions, weathers, seasons, and image quality; and 4) has high inter-class similarity and intra-class diversity. The proposed benchmark can help the researchers to develop and validate their data-driven methods. Finally, we evaluate several state-of-the-art approaches on our DIOR dataset to establish a baseline for future research.
Object detection typically assumes that training and test data are drawn from an identical distribution, which, however, does not always hold in practice. Such a distribution mismatch will lead to a significant performance drop. In this work, we aim to improve the cross-domain robustness of object detection. We tackle the domain shift on two levels: 1) the image-level shift, such as image style, illumination, etc, and 2) the instance-level shift, such as object appearance, size, etc. We build our approach based on the recent state-of-the-art Faster R-CNN model, and design two domain adaptation components, on image level and instance level, to reduce the domain discrepancy. The two domain adaptation components are based on H-divergence theory, and are implemented by learning a domain classifier in adversarial training manner. The domain classifiers on different levels are further reinforced with a consistency regularization to learn a domain-invariant region proposal network (RPN) in the Faster R-CNN model. We evaluate our newly proposed approach using multiple datasets including Cityscapes, KITTI, SIM10K, etc. The results demonstrate the effectiveness of our proposed approach for robust object detection in various domain shift scenarios.