In this paper, we investigate a cascaded channel estimation method for a millimeter wave (mmWave) massive multiple-input multiple-output (MIMO) system aided by a reconfigurable intelligent surface (RIS) with the BS equipped with low-resolution analog-to-digital converters (ADCs), where the BS and the RIS are both equipped with a uniform planar array (UPA). Due to the sparse property of mmWave channel, the channel estimation can be solved as a compressed sensing (CS) problem. However, the low-resolution quantization cause severe information loss of signals, and traditional CS algorithms are unable to work well. To recovery the signal and the sparse angular domain channel from quantization, we introduce Bayesian inference and efficient vector approximate message passing (VAMP) algorithm to solve the quantize output CS problem. To further improve the efficiency of the VAMP algorithm, a Fast Fourier Transform (FFT) based fast computation method is derived. Simulation results demonstrate the effectiveness and the accuracy of the proposed cascaded channel estimation method for the RIS-aided mmWave massive MIMO system with few-bit ADCs. Furthermore, the proposed channel estimation method can reach an acceptable performance gap between the low-resolution ADCs and the infinite ADCs for the low signal-to-noise ratio (SNR), which implies the applicability of few-bit ADCs in practice.
In this paper, we address the task of aberration-aware depth-from-defocus (DfD), which takes account of spatially variant point spread functions (PSFs) of a real camera. To effectively obtain the spatially variant PSFs of a real camera without requiring any ground-truth PSFs, we propose a novel self-supervised learning method that leverages the pair of real sharp and blurred images, which can be easily captured by changing the aperture setting of the camera. In our PSF estimation, we assume rotationally symmetric PSFs and introduce the polar coordinate system to more accurately learn the PSF estimation network. We also handle the focus breathing phenomenon that occurs in real DfD situations. Experimental results on synthetic and real data demonstrate the effectiveness of our method regarding both the PSF estimation and the depth estimation.
In this paper, we introduce a new flow-based method for global optimization of Lipschitz functions, called Stein Boltzmann Sampling (SBS). Our method samples from the Boltzmann distribution that becomes asymptotically uniform over the set of the minimizers of the function to be optimized. Candidate solutions are sampled via the \emph{Stein Variational Gradient Descent} algorithm. We prove the asymptotic convergence of our method, introduce two SBS variants, and provide a detailed comparison with several state-of-the-art global optimization algorithms on various benchmark functions. The design of our method, the theoretical results, and our experiments, suggest that SBS is particularly well-suited to be used as a continuation of efficient global optimization methods as it can produce better solutions while making a good use of the budget.
In this work, we address the problem of transferring an autonomous driving (AD) module from one domain to another, in particular from simulation to the real world (Sim2Real). We propose a data-efficient method for online and on-the-fly learning based adaptation for parametrizable control architectures such that the target closed-loop performance is optimized under several uncertainty sources such as model mismatches, environment changes and task choice. The novelty of the work resides in leveraging black-box optimization enabled by executable digital twins, with data-driven hyper-parameter tuning through derivative-free methods to directly adapt in real-time the AD module. Our proposed method requires a minimal amount of interaction with the real-world in the randomization and online training phase. Specifically, we validate our approach in real-world experiments and show the ability to transfer and safely tune a nonlinear model predictive controller in less than 10 minutes, eliminating the need of day-long manual tuning and hours-long machine learning training phases. Our results show that the online adapted NMPC directly compensates for disturbances, avoids overtuning in simulation and for one specific task, and it generalizes for less than 15cm of tracking accuracy over a multitude of trajectories, and leads to 83% tracking improvement.
In Visual SLAM, achieving accurate feature matching consumes a significant amount of time, severely impacting the real-time performance of the system. This paper proposes an accelerated method for Visual SLAM by integrating GMS (Grid-based Motion Statistics) with RANSAC (Random Sample Consensus) for the removal of mismatched features. The approach first utilizes the GMS algorithm to estimate the quantity of matched pairs within the neighborhood and ranks the matches based on their confidence. Subsequently, the Random Sample Consensus (RANSAC) algorithm is employed to further eliminate mismatched features. To address the time-consuming issue of randomly selecting all matched pairs, this method transforms it into the problem of prioritizing sample selection from high-confidence matches. This enables the iterative solution of the optimal model. Experimental results demonstrate that the proposed method achieves a comparable accuracy to the original GMS-RANSAC while reducing the average runtime by 24.13% on the KITTI, TUM desk, and TUM doll datasets.
In this paper, we propose our information-theoretic equivalence of entropic multi-marginal optimal transport (MOT). This equivalence can be easily reduced to the case of entropic optimal transport (OT). Because OT is widely used to compare differences between knowledge or beliefs, we apply this result to the communication between agents with different beliefs. Our results formally prove the statement that entropic OT is information-theoretically optimal given by Wang et al. [2020] and generalize it to the multi-agent case. We believe that our work can shed light on OT theory in future multi-agent teaming systems.
In this paper we will derive an non-local (``integral'') equation which transforms a three-dimensional acoustic transmission problem with \emph{variable} coefficients, non-zero absorption, and mixed boundary conditions to a non-local equation on a ``skeleton'' of the domain $\Omega\subset\mathbb{R}^{3}$, where ``skeleton'' stands for the union of the interfaces and boundaries of a Lipschitz partition of $\Omega$. To that end, we introduce and analyze abstract layer potentials as solutions of auxiliary coercive full space variational problems and derive jump conditions across domain interfaces. This allows us to formulate the non-local skeleton equation as a \emph{direct method} for the unknown Cauchy data of the solution of the original partial differential equation. We establish coercivity and continuity of the variational form of the skeleton equation based on auxiliary full space variational problems. Explicit expressions for Green's functions is not required and all our estimates are \emph{explicit} in the complex wave number.
In this paper, we tackle two challenges in multimodal learning for visual recognition: 1) when missing-modality occurs either during training or testing in real-world situations; and 2) when the computation resources are not available to finetune on heavy transformer models. To this end, we propose to utilize prompt learning and mitigate the above two challenges together. Specifically, our modality-missing-aware prompts can be plugged into multimodal transformers to handle general missing-modality cases, while only requiring less than 1% learnable parameters compared to training the entire model. We further explore the effect of different prompt configurations and analyze the robustness to missing modality. Extensive experiments are conducted to show the effectiveness of our prompt learning framework that improves the performance under various missing-modality cases, while alleviating the requirement of heavy model re-training. Code is available.
Image-level weakly supervised semantic segmentation (WSSS) is a fundamental yet challenging computer vision task facilitating scene understanding and automatic driving. Most existing methods resort to classification-based Class Activation Maps (CAMs) to play as the initial pseudo labels, which tend to focus on the discriminative image regions and lack customized characteristics for the segmentation task. To alleviate this issue, we propose a novel activation modulation and recalibration (AMR) scheme, which leverages a spotlight branch and a compensation branch to obtain weighted CAMs that can provide recalibration supervision and task-specific concepts. Specifically, an attention modulation module (AMM) is employed to rearrange the distribution of feature importance from the channel-spatial sequential perspective, which helps to explicitly model channel-wise interdependencies and spatial encodings to adaptively modulate segmentation-oriented activation responses. Furthermore, we introduce a cross pseudo supervision for dual branches, which can be regarded as a semantic similar regularization to mutually refine two branches. Extensive experiments show that AMR establishes a new state-of-the-art performance on the PASCAL VOC 2012 dataset, surpassing not only current methods trained with the image-level of supervision but also some methods relying on stronger supervision, such as saliency label. Experiments also reveal that our scheme is plug-and-play and can be incorporated with other approaches to boost their performance.
In this paper, we propose a novel Feature Decomposition and Reconstruction Learning (FDRL) method for effective facial expression recognition. We view the expression information as the combination of the shared information (expression similarities) across different expressions and the unique information (expression-specific variations) for each expression. More specifically, FDRL mainly consists of two crucial networks: a Feature Decomposition Network (FDN) and a Feature Reconstruction Network (FRN). In particular, FDN first decomposes the basic features extracted from a backbone network into a set of facial action-aware latent features to model expression similarities. Then, FRN captures the intra-feature and inter-feature relationships for latent features to characterize expression-specific variations, and reconstructs the expression feature. To this end, two modules including an intra-feature relation modeling module and an inter-feature relation modeling module are developed in FRN. Experimental results on both the in-the-lab databases (including CK+, MMI, and Oulu-CASIA) and the in-the-wild databases (including RAF-DB and SFEW) show that the proposed FDRL method consistently achieves higher recognition accuracy than several state-of-the-art methods. This clearly highlights the benefit of feature decomposition and reconstruction for classifying expressions.
In this paper, we propose the joint learning attention and recurrent neural network (RNN) models for multi-label classification. While approaches based on the use of either model exist (e.g., for the task of image captioning), training such existing network architectures typically require pre-defined label sequences. For multi-label classification, it would be desirable to have a robust inference process, so that the prediction error would not propagate and thus affect the performance. Our proposed model uniquely integrates attention and Long Short Term Memory (LSTM) models, which not only addresses the above problem but also allows one to identify visual objects of interests with varying sizes without the prior knowledge of particular label ordering. More importantly, label co-occurrence information can be jointly exploited by our LSTM model. Finally, by advancing the technique of beam search, prediction of multiple labels can be efficiently achieved by our proposed network model.