We propose a novel framework for filtering image-text data by leveraging fine-tuned Multimodal Language Models (MLMs). Our approach outperforms predominant filtering methods (e.g., CLIPScore) via integrating the recent advances in MLMs. We design four distinct yet complementary metrics to holistically measure the quality of image-text data. A new pipeline is established to construct high-quality instruction data for fine-tuning MLMs as data filters. Comparing with CLIPScore, our MLM filters produce more precise and comprehensive scores that directly improve the quality of filtered data and boost the performance of pre-trained models. We achieve significant improvements over CLIPScore on popular foundation models (i.e., CLIP and BLIP2) and various downstream tasks. Our MLM filter can generalize to different models and tasks, and be used as a drop-in replacement for CLIPScore. An additional ablation study is provided to verify our design choices for the MLM filter.
Automatic 3D facial texture generation has gained significant interest recently. Existing approaches may not support the traditional physically based rendering pipeline or rely on 3D data captured by Light Stage. Our key contribution is a progressive latent space refinement approach that can bootstrap from 3D Morphable Models (3DMMs)-based texture maps generated from facial images to generate high-quality and diverse PBR textures, including albedo, normal, and roughness. It starts with enhancing Generative Adversarial Networks (GANs) for text-guided and diverse texture generation. To this end, we design a self-supervised paradigm to overcome the reliance on ground truth 3D textures and train the generative model with only entangled texture maps. Besides, we foster mutual enhancement between GANs and Score Distillation Sampling (SDS). SDS boosts GANs with more generative modes, while GANs promote more efficient optimization of SDS. Furthermore, we introduce an edge-aware SDS for multi-view consistent facial structure. Experiments demonstrate that our method outperforms existing 3D texture generation methods regarding photo-realistic quality, diversity, and efficiency.
Cross-Domain Sequential Recommendation (CDSR) methods aim to tackle the data sparsity and cold-start problems present in Single-Domain Sequential Recommendation (SDSR). Existing CDSR works design their elaborate structures relying on overlapping users to propagate the cross-domain information. However, current CDSR methods make closed-world assumptions, assuming fully overlapping users across multiple domains and that the data distribution remains unchanged from the training environment to the test environment. As a result, these methods typically result in lower performance on online real-world platforms due to the data distribution shifts. To address these challenges under open-world assumptions, we design an \textbf{A}daptive \textbf{M}ulti-\textbf{I}nterest \textbf{D}ebiasing framework for cross-domain sequential recommendation (\textbf{AMID}), which consists of a multi-interest information module (\textbf{MIM}) and a doubly robust estimator (\textbf{DRE}). Our framework is adaptive for open-world environments and can improve the model of most off-the-shelf single-domain sequential backbone models for CDSR. Our MIM establishes interest groups that consider both overlapping and non-overlapping users, allowing us to effectively explore user intent and explicit interest. To alleviate biases across multiple domains, we developed the DRE for the CDSR methods. We also provide a theoretical analysis that demonstrates the superiority of our proposed estimator in terms of bias and tail bound, compared to the IPS estimator used in previous work.
In the era of large language models, applying techniques such as Retrieval Augmented Generation can better address Open-Domain Question-Answering problems. Due to constraints including model sizes and computing resources, the length of context is often limited, and it becomes challenging to empower the model to cover overlong contexts while answering questions from open domains. This paper proposes a general and convenient method to covering longer contexts in Open-Domain Question-Answering tasks. It leverages a small encoder language model that effectively encodes contexts, and the encoding applies cross-attention with origin inputs. With our method, the origin language models can cover several times longer contexts while keeping the computing requirements close to the baseline. Our experiments demonstrate that after fine-tuning, there is improved performance across two held-in datasets, four held-out datasets, and also in two In Context Learning settings.
We introduce a novel doubly-robust (DR) off-policy evaluation (OPE) estimator for Markov decision processes, DRUnknown, designed for situations where both the logging policy and the value function are unknown. The proposed estimator initially estimates the logging policy and then estimates the value function model by minimizing the asymptotic variance of the estimator while considering the estimating effect of the logging policy. When the logging policy model is correctly specified, DRUnknown achieves the smallest asymptotic variance within the class containing existing OPE estimators. When the value function model is also correctly specified, DRUnknown is optimal as its asymptotic variance reaches the semiparametric lower bound. We present experimental results conducted in contextual bandits and reinforcement learning to compare the performance of DRUnknown with that of existing methods.
We present a novel approach to automatically synthesize "wayfinding instructions" for an embodied robot agent. In contrast to prior approaches that are heavily reliant on human-annotated datasets designed exclusively for specific simulation platforms, our algorithm uses in-context learning to condition an LLM to generate instructions using just a few references. Using an LLM-based Visual Question Answering strategy, we gather detailed information about the environment which is used by the LLM for instruction synthesis. We implement our approach on multiple simulation platforms including Matterport3D, AI Habitat and ThreeDWorld, thereby demonstrating its platform-agnostic nature. We subjectively evaluate our approach via a user study and observe that 83.3% of users find the synthesized instructions accurately capture the details of the environment and show characteristics similar to those of human-generated instructions. Further, we conduct zero-shot navigation with multiple approaches on the REVERIE dataset using the generated instructions, and observe very close correlation with the baseline on standard success metrics (< 1% change in SR), quantifying the viability of generated instructions in replacing human-annotated data. We finally discuss the applicability of our approach in enabling a generalizable evaluation of embodied navigation policies. To the best of our knowledge, ours is the first LLM-driven approach capable of generating "human-like" instructions in a platform-agnostic manner, without training.
This paper develops a numerical procedure to accelerate the convergence of the Favre-averaged Non-Linear Harmonic (FNLH) method. The scheme provides a unified mathematical framework for solving the sparse linear systems formed by the mean flow and the time-linearized harmonic flows of FNLH in an explicit or implicit fashion. The approach explores the similarity of the sparse linear systems of FNLH and leads to a memory efficient procedure, so that its memory consumption does not depend on the number of harmonics to compute. The proposed method has been implemented in the industrial CFD solver HYDRA. Two test cases are used to conduct a comparative study of explicit and implicit schemes in terms of convergence, computational efficiency, and memory consumption. Comparisons show that the implicit scheme yields better convergence than the explicit scheme and is also roughly 7 to 10 times more computationally efficient than the explicit scheme with 4 levels of multigrid. Furthermore, the implicit scheme consumes only approximately $50\%$ of the explicit scheme with four levels of multigrid. Compared with the full annulus unsteady Reynolds averaged Navier-Stokes (URANS) simulations, the implicit scheme produces comparable results to URANS with computational time and memory consumption that are two orders of magnitude smaller.
Database de-anonymization typically involves matching an anonymized database with correlated publicly available data. Existing research focuses either on practical aspects without requiring knowledge of the data distribution yet provides limited guarantees, or on theoretical aspects assuming known distributions. This paper aims to bridge these two approaches, offering theoretical guarantees for database de-anonymization under synchronization errors and obfuscation without prior knowledge of data distribution. Using a modified replica detection algorithm and a new seeded deletion detection algorithm, we establish sufficient conditions on the database growth rate for successful matching, demonstrating a double-logarithmic seed size relative to row size is sufficient for detecting deletions in the database. Importantly, our findings indicate that these sufficient de-anonymization conditions are tight and are the same as in the distribution-aware setting, avoiding asymptotic performance loss due to unknown distributions. Finally, we evaluate the performance of our proposed algorithms through simulations, confirming their effectiveness in more practical, non-asymptotic, scenarios.
This paper presents a scalable multi-robot motion planning algorithm called Conflict-Based Model Predictive Control (CB-MPC). Inspired by Conflict-Based Search (CBS), the planner leverages a similar high-level conflict tree to efficiently resolve robot-robot conflicts in the continuous space, while reasoning about each agent's kinematic and dynamic constraints and actuation limits using MPC as the low-level planner. We show that tracking high-level multi-robot plans with a vanilla MPC controller is insufficient, and results in unexpected collisions in tight navigation scenarios. Compared to other variations of multi-robot MPC like joint, prioritized, and distributed, we demonstrate that CB-MPC improves the executability and success rate, allows for closer robot-robot interactions, and reduces the computational cost significantly without compromising the solution quality across a variety of environments. Furthermore, we show that CB-MPC combined with a high-level path planner can effectively substitute computationally expensive full-horizon multi-robot kinodynamic planners.
Few-shot image classification aims to classify unseen classes with limited labeled samples. Recent works benefit from the meta-learning process with episodic tasks and can fast adapt to class from training to testing. Due to the limited number of samples for each task, the initial embedding network for meta learning becomes an essential component and can largely affects the performance in practice. To this end, many pre-trained methods have been proposed, and most of them are trained in supervised way with limited transfer ability for unseen classes. In this paper, we proposed to train a more generalized embedding network with self-supervised learning (SSL) which can provide slow and robust representation for downstream tasks by learning from the data itself. We evaluate our work by extensive comparisons with previous baseline methods on two few-shot classification datasets ({\em i.e.,} MiniImageNet and CUB). Based on the evaluation results, the proposed method achieves significantly better performance, i.e., improve 1-shot and 5-shot tasks by nearly \textbf{3\%} and \textbf{4\%} on MiniImageNet, by nearly \textbf{9\%} and \textbf{3\%} on CUB. Moreover, the proposed method can gain the improvement of (\textbf{15\%}, \textbf{13\%}) on MiniImageNet and (\textbf{15\%}, \textbf{8\%}) on CUB by pretraining using more unlabeled data. Our code will be available at \hyperref[//github.com/phecy/SSL-FEW-SHOT.]{//github.com/phecy/ssl-few-shot.}
Medical image segmentation requires consensus ground truth segmentations to be derived from multiple expert annotations. A novel approach is proposed that obtains consensus segmentations from experts using graph cuts (GC) and semi supervised learning (SSL). Popular approaches use iterative Expectation Maximization (EM) to estimate the final annotation and quantify annotator's performance. Such techniques pose the risk of getting trapped in local minima. We propose a self consistency (SC) score to quantify annotator consistency using low level image features. SSL is used to predict missing annotations by considering global features and local image consistency. The SC score also serves as the penalty cost in a second order Markov random field (MRF) cost function optimized using graph cuts to derive the final consensus label. Graph cut obtains a global maximum without an iterative procedure. Experimental results on synthetic images, real data of Crohn's disease patients and retinal images show our final segmentation to be accurate and more consistent than competing methods.