The core of cross-modal matching is to accurately measure the similarity between different modalities in a unified representation space. However, compared to textual descriptions of a certain perspective, the visual modality has more semantic variations. So, images are usually associated with multiple textual captions in databases. Although popular symmetric embedding methods have explored numerous modal interaction approaches, they often learn toward increasing the average expression probability of multiple semantic variations within image embeddings. Consequently, information entropy in embeddings is increased, resulting in redundancy and decreased accuracy. In this work, we propose a Dynamic Visual Semantic Sub-Embeddings framework (DVSE) to reduce the information entropy. Specifically, we obtain a set of heterogeneous visual sub-embeddings through dynamic orthogonal constraint loss. To encourage the generated candidate embeddings to capture various semantic variations, we construct a mixed distribution and employ a variance-aware weighting loss to assign different weights to the optimization process. In addition, we develop a Fast Re-ranking strategy (FR) to efficiently evaluate the retrieval results and enhance the performance. We compare the performance with existing set-based method using four image feature encoders and two text feature encoders on three benchmark datasets: MSCOCO, Flickr30K and CUB Captions. We also show the role of different components by ablation studies and perform a sensitivity analysis of the hyperparameters. The qualitative analysis of visualized bidirectional retrieval and attention maps further demonstrates the ability of our method to encode semantic variations.
Recent molecular communication (MC) research has integrated more detailed computational models to capture the dynamics of practical biophysical systems. This research focuses on developing realistic models for MC transceivers inspired by spheroids - three-dimensional cell aggregates commonly used in organ-on-chip experimental systems. Potential applications that can be used or modeled with spheroids include nutrient transport in an organ-on-chip system, the release of biomarkers or reception of drug molecules by a cancerous tumor site, or transceiver nanomachines participating in information exchange. In this paper, a simple diffusive MC system is considered where a spheroidal transmitter and receiver are in an unbounded fluid environment. These spheroidal antennas are modeled as porous media for diffusive signaling molecules, then their boundary conditions and effective diffusion coefficients are characterized. Further, for either a point source or spheroidal transmitter, Green's function for concentration (GFC) outside and inside the receiving spheroid is analytically derived and formulated in terms of an infinite series and confirmed by a particle-based simulator (PBS). The provided GFCs enable computation of the transmitted and received signals in the spheroidal communication system. This study shows that the porous structure of the receiving spheroid amplifies diffusion signals but also disperses them, thus there is a trade-off between porosity and information transmission rate. Also, the results reveal that the porous arrangement of the transmitting spheroid not only disperses the received signal but also attenuates it. System performance is also evaluated in terms of bit error rate (BER). Decreasing the porosity of the receiving spheroid is shown to enhance system performance. Conversely, reducing the porosity of the transmitting spheroid can adversely affect system performance.
A decision tree is one of the most popular approaches in machine learning fields. However, it suffers from the problem of overfitting caused by overly deepened trees. Then, a meta-tree is recently proposed. It solves the problem of overfitting caused by overly deepened trees. Moreover, the meta-tree guarantees statistical optimality based on Bayes decision theory. Therefore, the meta-tree is expected to perform better than the decision tree. In contrast to a single decision tree, it is known that ensembles of decision trees, which are typically constructed boosting algorithms, are more effective in improving predictive performance. Thus, it is expected that ensembles of meta-trees are more effective in improving predictive performance than a single meta-tree, and there are no previous studies that construct multiple meta-trees in boosting. Therefore, in this study, we propose a method to construct multiple meta-trees using a boosting approach. Through experiments with synthetic and benchmark datasets, we conduct a performance comparison between the proposed methods and the conventional methods using ensembles of decision trees. Furthermore, while ensembles of decision trees can cause overfitting as well as a single decision tree, experiments confirmed that ensembles of meta-trees can prevent overfitting due to the tree depth.
Modeling the correlations among errors is closely associated with how accurately the model can quantify predictive uncertainty in probabilistic time series forecasting. Recent multivariate models have made significant progress in accounting for contemporaneous correlations among errors, while a common assumption on these errors is that they are temporally independent for the sake of statistical simplicity. However, real-world observations often deviate from this assumption, since errors usually exhibit substantial autocorrelation due to various factors such as the exclusion of temporally correlated covariates. In this work, we propose an efficient method, based on a low-rank-plus-diagonal parameterization of the covariance matrix, which can effectively characterize the autocorrelation of errors. The proposed method possesses several desirable properties: the complexity does not scale with the number of time series, the resulting covariance can be used for calibrating predictions, and it can seamlessly integrate with any model with Gaussian-distributed errors. We empirically demonstrate these properties using two distinct neural forecasting models-GPVar and Transformer. Our experimental results confirm the effectiveness of our method in enhancing predictive accuracy and the quality of uncertainty quantification on multiple real-world datasets.
Transformers have recently emerged as a significant force in the field of image deraining. Existing image deraining methods utilize extensive research on self-attention. Though showcasing impressive results, they tend to neglect critical frequency information, as self-attention is generally less adept at capturing high-frequency details. To overcome this shortcoming, we have developed an innovative Dual-Path Coupled Deraining Network (DPCNet) that integrates information from both spatial and frequency domains through Spatial Feature Extraction Block (SFEBlock) and Frequency Feature Extraction Block (FFEBlock). We have further introduced an effective Adaptive Fusion Module (AFM) for the dual-path feature aggregation. Extensive experiments on six public deraining benchmarks and downstream vision tasks have demonstrated that our proposed method not only outperforms the existing state-of-the-art deraining method but also achieves visually pleasuring results with excellent robustness on downstream vision tasks.
Linear regression adjustment is commonly used to analyse randomised controlled experiments due to its efficiency and robustness against model misspecification. Current testing and interval estimation procedures leverage the asymptotic distribution of such estimators to provide Type-I error and coverage guarantees that hold only at a single sample size. Here, we develop the theory for the anytime-valid analogues of such procedures, enabling linear regression adjustment in the sequential analysis of randomised experiments. We first provide sequential $F$-tests and confidence sequences for the parametric linear model, which provide time-uniform Type-I error and coverage guarantees that hold for all sample sizes. We then relax all linear model parametric assumptions in randomised designs and provide nonparametric model-free sequential tests and confidence sequences for treatment effects. This formally allows experiments to be continuously monitored for significance, stopped early, and safeguards against statistical malpractices in data collection. A particular feature of our results is their simplicity. Our test statistics and confidence sequences all emit closed-form expressions, which are functions of statistics directly available from a standard linear regression table. We illustrate our methodology with the sequential analysis of software A/B experiments at Netflix, performing regression adjustment with pre-treatment outcomes.
Movable antenna (MA) provides an innovative way to arrange antennas that can contribute to improved signal quality and more effective interference management. This method is especially beneficial for co-frequency co-time full-duplex (CCFD) wireless communication, which struggles with self-interference (SI) that usually overpowers the desired incoming signals. By dynamically repositioning transmit/receive antennas, we can mitigate the SI and enhance the reception of incoming signals. Thus, this paper proposes a novel MA-enabled point-to-point CCFD system and formulates the minimum achievable rate of two CCFD terminals. To maximize the minimum achievable rate and determine the near-optimal positions of the MAs, we introduce a solution based on projected particle swarm optimization (PPSO), which can circumvent common suboptimal positioning issues. Moreover, numerical results reveal that the PPSO method leads to a better performance compared to the conventional alternating position optimization (APO). The results also demonstrate that an MA-enabled CCFD system outperforms the one using fixed-position antennas (FPAs).
Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio. Machine perception models, in stark contrast, are typically modality-specific and optimised for unimodal benchmarks, and hence late-stage fusion of final representations or predictions from each modality (`late-fusion') is still a dominant paradigm for multimodal video classification. Instead, we introduce a novel transformer based architecture that uses `fusion bottlenecks' for modality fusion at multiple layers. Compared to traditional pairwise self-attention, our model forces information between different modalities to pass through a small number of bottleneck latents, requiring the model to collate and condense the most relevant information in each modality and only share what is necessary. We find that such a strategy improves fusion performance, at the same time reducing computational cost. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks including Audioset, Epic-Kitchens and VGGSound. All code and models will be released.
Conventional methods for object detection typically require a substantial amount of training data and preparing such high-quality training data is very labor-intensive. In this paper, we propose a novel few-shot object detection network that aims at detecting objects of unseen categories with only a few annotated examples. Central to our method are our Attention-RPN, Multi-Relation Detector and Contrastive Training strategy, which exploit the similarity between the few shot support set and query set to detect novel objects while suppressing false detection in the background. To train our network, we contribute a new dataset that contains 1000 categories of various objects with high-quality annotations. To the best of our knowledge, this is one of the first datasets specifically designed for few-shot object detection. Once our few-shot network is trained, it can detect objects of unseen categories without further training or fine-tuning. Our method is general and has a wide range of potential applications. We produce a new state-of-the-art performance on different datasets in the few-shot setting. The dataset link is //github.com/fanq15/Few-Shot-Object-Detection-Dataset.
Image-to-image translation aims to learn the mapping between two visual domains. There are two main challenges for many applications: 1) the lack of aligned training pairs and 2) multiple possible outputs from a single input image. In this work, we present an approach based on disentangled representation for producing diverse outputs without paired training images. To achieve diversity, we propose to embed images onto two spaces: a domain-invariant content space capturing shared information across domains and a domain-specific attribute space. Our model takes the encoded content features extracted from a given input and the attribute vectors sampled from the attribute space to produce diverse outputs at test time. To handle unpaired training data, we introduce a novel cross-cycle consistency loss based on disentangled representations. Qualitative results show that our model can generate diverse and realistic images on a wide range of tasks without paired training data. For quantitative comparisons, we measure realism with user study and diversity with a perceptual distance metric. We apply the proposed model to domain adaptation and show competitive performance when compared to the state-of-the-art on the MNIST-M and the LineMod datasets.
Multi-relation Question Answering is a challenging task, due to the requirement of elaborated analysis on questions and reasoning over multiple fact triples in knowledge base. In this paper, we present a novel model called Interpretable Reasoning Network that employs an interpretable, hop-by-hop reasoning process for question answering. The model dynamically decides which part of an input question should be analyzed at each hop; predicts a relation that corresponds to the current parsed results; utilizes the predicted relation to update the question representation and the state of the reasoning process; and then drives the next-hop reasoning. Experiments show that our model yields state-of-the-art results on two datasets. More interestingly, the model can offer traceable and observable intermediate predictions for reasoning analysis and failure diagnosis.