Backdoor attacks are commonly executed by contaminating training data, such that a trigger can activate predetermined harmful effects during the test phase. In this work, we present AnyDoor, a test-time backdoor attack against multimodal large language models (MLLMs), which involves injecting the backdoor into the textual modality using adversarial test images (sharing the same universal perturbation), without requiring access to or modification of the training data. AnyDoor employs similar techniques used in universal adversarial attacks, but distinguishes itself by its ability to decouple the timing of setup and activation of harmful effects. In our experiments, we validate the effectiveness of AnyDoor against popular MLLMs such as LLaVA-1.5, MiniGPT-4, InstructBLIP, and BLIP-2, as well as provide comprehensive ablation studies. Notably, because the backdoor is injected by a universal perturbation, AnyDoor can dynamically change its backdoor trigger prompts/harmful effects, exposing a new challenge for defending against backdoor attacks. Our project page is available at //sail-sg.github.io/AnyDoor/.
Turning pass-through network architectures into iterative ones, which use their own output as input, is a well-known approach for boosting performance. In this paper, we argue that such architectures offer an additional benefit: The convergence rate of their successive outputs is highly correlated with the accuracy of the value to which they converge. Thus, we can use the convergence rate as a useful proxy for uncertainty. This results in an approach to uncertainty estimation that provides state-of-the-art estimates at a much lower computational cost than techniques like Ensembles, and without requiring any modifications to the original iterative model. We demonstrate its practical value by embedding it in two application domains: road detection in aerial images and the estimation of aerodynamic properties of 2D and 3D shapes.
Exoskeleton locomotion must be robust while being adaptive to different users with and without payloads. To address these challenges, this work introduces a data-driven predictive control (DDPC) framework to synthesize walking gaits for lower-body exoskeletons, employing Hankel matrices and a state transition matrix for its data-driven model. The proposed approach leverages DDPC through a multi-layer architecture. At the top layer, DDPC serves as a planner employing Hankel matrices and a state transition matrix to generate a data-driven model that can learn and adapt to varying users and payloads. At the lower layer, our method incorporates inverse kinematics and passivity-based control to map the planned trajectory from DDPC into the full-order states of the lower-body exoskeleton. We validate the effectiveness of this approach through numerical simulations and hardware experiments conducted on the Atalante lower-body exoskeleton with different payloads. Moreover, we conducted a comparative analysis against the model predictive control (MPC) framework based on the reduced-order linear inverted pendulum (LIP) model. Through this comparison, the paper demonstrates that DDPC enables robust bipedal walking at various velocities while accounting for model uncertainties and unknown perturbations.
In the rapidly evolving landscape of artificial intelligence, multimodal learning systems (MMLS) have gained traction for their ability to process and integrate information from diverse modality inputs. Their expanding use in vital sectors such as healthcare has made safety assurance a critical concern. However, the absence of systematic research into their safety is a significant barrier to progress in this field. To bridge the gap, we present the first taxonomy that systematically categorizes and assesses MMLS safety. This taxonomy is structured around four fundamental pillars that are critical to ensuring the safety of MMLS: robustness, alignment, monitoring, and controllability. Leveraging this taxonomy, we review existing methodologies, benchmarks, and the current state of research, while also pinpointing the principal limitations and gaps in knowledge. Finally, we discuss unique challenges in MMLS safety. In illuminating these challenges, we aim to pave the way for future research, proposing potential directions that could lead to significant advancements in the safety protocols of MMLS.
Data-driven optimization models have the potential to significantly improve hospital capacity management, particularly during demand surges, when effective allocation of capacity is most critical and challenging. However, integrating models into existing processes in a way that provides value requires recognizing that hospital administrators are ultimately responsible for making capacity management decisions, and carefully building trustworthy and accessible tools for them. In this study, we develop an interactive, user-friendly, electronic dashboard for informing hospital capacity management decisions during surge periods. The dashboard integrates real-time hospital data, predictive analytics, and optimization models. It allows hospital administrators to interactively customize parameters, enabling them to explore a range of scenarios, and provides real-time updates on recommended optimal decisions. The dashboard was created through a participatory design process, involving hospital administrators in the development team to ensure practical utility, trustworthiness, transparency, explainability, and usability. We successfully deployed our dashboard within the Johns Hopkins Health System during the height of the COVID-19 pandemic, addressing the increased need for tools to inform hospital capacity management. It was used on a daily basis, with results regularly communicated to hospital leadership. This study demonstrates the practical application of a prospective, data-driven, interactive decision-support tool for hospital system capacity management.
We propose a data-driven control method for systems with aleatoric uncertainty, for example, robot fleets with variations between agents. Our method leverages shared trajectory data to increase the robustness of the designed controller and thus facilitate transfer to new variations without the need for prior parameter and uncertainty estimations. In contrast to existing work on experience transfer for performance, our approach focuses on robustness and uses data collected from multiple realizations to guarantee generalization to unseen ones. Our method is based on scenario optimization combined with recent formulations for direct data-driven control. We derive lower bounds on the amount of data required to achieve quadratic stability for probabilistic systems with aleatoric uncertainty and demonstrate the benefits of our data-driven method through a numerical example. We find that the learned controllers generalize well to high variations in the dynamics even when based on only a few short open-loop trajectories. Robust experience transfer enables the design of safe and robust controllers that work out of the box without any additional learning during deployment.
Instruction-tuned LLMs can respond to explicit queries formulated as prompts, which greatly facilitates interaction with human users. However, prompt-based approaches might not always be able to tap into the wealth of implicit knowledge acquired by LLMs during pre-training. This paper presents a comprehensive study of ways to evaluate semantic plausibility in LLMs. We compare base and instruction-tuned LLM performance on an English sentence plausibility task via (a) explicit prompting and (b) implicit estimation via direct readout of the probabilities models assign to strings. Experiment 1 shows that, across model architectures and plausibility datasets, (i) log likelihood ($\textit{LL}$) scores are the most reliable indicator of sentence plausibility, with zero-shot prompting yielding inconsistent and typically poor results; (ii) $\textit{LL}$-based performance is still inferior to human performance; (iii) instruction-tuned models have worse $\textit{LL}$-based performance than base models. In Experiment 2, we show that $\textit{LL}$ scores across models are modulated by context in the expected way, showing high performance on three metrics of context-sensitive plausibility and providing a direct match to explicit human plausibility judgments. Overall, $\textit{LL}$ estimates remain a more reliable measure of plausibility in LLMs than direct prompting.
We present a large-scale study on unsupervised spatiotemporal representation learning from videos. With a unified perspective on four recent image-based frameworks, we study a simple objective that can easily generalize all these methods to space-time. Our objective encourages temporally-persistent features in the same video, and in spite of its simplicity, it works surprisingly well across: (i) different unsupervised frameworks, (ii) pre-training datasets, (iii) downstream datasets, and (iv) backbone architectures. We draw a series of intriguing observations from this study, e.g., we discover that encouraging long-spanned persistency can be effective even if the timespan is 60 seconds. In addition to state-of-the-art results in multiple benchmarks, we report a few promising cases in which unsupervised pre-training can outperform its supervised counterpart. Code is made available at //github.com/facebookresearch/SlowFast
Approaches based on deep neural networks have achieved striking performance when testing data and training data share similar distribution, but can significantly fail otherwise. Therefore, eliminating the impact of distribution shifts between training and testing data is crucial for building performance-promising deep models. Conventional methods assume either the known heterogeneity of training data (e.g. domain labels) or the approximately equal capacities of different domains. In this paper, we consider a more challenging case where neither of the above assumptions holds. We propose to address this problem by removing the dependencies between features via learning weights for training samples, which helps deep models get rid of spurious correlations and, in turn, concentrate more on the true connection between discriminative features and labels. Extensive experiments clearly demonstrate the effectiveness of our method on multiple distribution generalization benchmarks compared with state-of-the-art counterparts. Through extensive experiments on distribution generalization benchmarks including PACS, VLCS, MNIST-M, and NICO, we show the effectiveness of our method compared with state-of-the-art counterparts.
Representation learning on a knowledge graph (KG) is to embed entities and relations of a KG into low-dimensional continuous vector spaces. Early KG embedding methods only pay attention to structured information encoded in triples, which would cause limited performance due to the structure sparseness of KGs. Some recent attempts consider paths information to expand the structure of KGs but lack explainability in the process of obtaining the path representations. In this paper, we propose a novel Rule and Path-based Joint Embedding (RPJE) scheme, which takes full advantage of the explainability and accuracy of logic rules, the generalization of KG embedding as well as the supplementary semantic structure of paths. Specifically, logic rules of different lengths (the number of relations in rule body) in the form of Horn clauses are first mined from the KG and elaborately encoded for representation learning. Then, the rules of length 2 are applied to compose paths accurately while the rules of length 1 are explicitly employed to create semantic associations among relations and constrain relation embeddings. Besides, the confidence level of each rule is also considered in optimization to guarantee the availability of applying the rule to representation learning. Extensive experimental results illustrate that RPJE outperforms other state-of-the-art baselines on KG completion task, which also demonstrate the superiority of utilizing logic rules as well as paths for improving the accuracy and explainability of representation learning.
Object detection typically assumes that training and test data are drawn from an identical distribution, which, however, does not always hold in practice. Such a distribution mismatch will lead to a significant performance drop. In this work, we aim to improve the cross-domain robustness of object detection. We tackle the domain shift on two levels: 1) the image-level shift, such as image style, illumination, etc, and 2) the instance-level shift, such as object appearance, size, etc. We build our approach based on the recent state-of-the-art Faster R-CNN model, and design two domain adaptation components, on image level and instance level, to reduce the domain discrepancy. The two domain adaptation components are based on H-divergence theory, and are implemented by learning a domain classifier in adversarial training manner. The domain classifiers on different levels are further reinforced with a consistency regularization to learn a domain-invariant region proposal network (RPN) in the Faster R-CNN model. We evaluate our newly proposed approach using multiple datasets including Cityscapes, KITTI, SIM10K, etc. The results demonstrate the effectiveness of our proposed approach for robust object detection in various domain shift scenarios.