We present an extensive, in-depth analysis of the eye tracking capabilities of the Meta Quest Pro virtual reality headset using a dataset of eye movement recordings collected from 78 participants. In addition to presenting classical signal quality metrics--spatial accuracy, spatial precision and linearity--in ideal settings, we also study the impact of background luminance and headset slippage on device performance. We additionally present a user-centered analysis of eye tracking signal quality, where we highlight the potential differences in user experience as a function of device performance. This work contributes to a growing understanding of eye tracking signal quality in virtual reality headsets, where the performance of applications such as gaze-based interaction, foveated rendering, and social gaze are directly dependent on the quality of eye tracking signal.
Phishing, a prevalent cybercrime tactic for decades, remains a significant threat in today's digital world. By leveraging clever social engineering elements and modern technology, cybercrime targets many individuals, businesses, and organizations to exploit trust and security. These cyber-attackers are often disguised in many trustworthy forms to appear as legitimate sources. By cleverly using psychological elements like urgency, fear, social proof, and other manipulative strategies, phishers can lure individuals into revealing sensitive and personalized information. Building on this pervasive issue within modern technology, this paper aims to analyze the effectiveness of 15 Large Language Models (LLMs) in detecting phishing attempts, specifically focusing on a randomized set of "419 Scam" emails. The objective is to determine which LLMs can accurately detect phishing emails by analyzing a text file containing email metadata based on predefined criteria. The experiment concluded that the following models, ChatGPT 3.5, GPT-3.5-Turbo-Instruct, and ChatGPT, were the most effective in detecting phishing emails.
We introduce XFT, a simple yet powerful training scheme, by simply merging upcycled Mixture-of-Experts (MoE) to unleash the performance limit of instruction-tuned code Large Language Models (LLMs). While vanilla sparse upcycling fails to improve instruction tuning, XFT introduces a shared expert mechanism with a novel routing weight normalization strategy into sparse upcycling, which significantly boosts instruction tuning. After fine-tuning the upcycled MoE model, XFT introduces a learnable model merging mechanism to compile the upcycled MoE model back to a dense model, achieving upcycled MoE-level performance with only dense-model compute. By applying XFT to a 1.3B model, we create a new state-of-the-art tiny code LLM (<3B) with 67.1 and 64.6 pass@1 on HumanEval and HumanEval+ respectively. With the same data and model architecture, XFT improves supervised fine-tuning (SFT) by 13% on HumanEval+, along with consistent improvements from 2% to 13% on MBPP+, MultiPL-E, and DS-1000, demonstrating its generalizability. XFT is fully orthogonal to existing techniques such as Evol-Instruct and OSS-Instruct, opening a new dimension for improving code instruction tuning. Codes are available at //github.com/ise-uiuc/xft .
As one of the fundamental video tasks in computer vision, Open-Vocabulary Action Recognition (OVAR) recently gains increasing attention, with the development of vision-language pre-trainings. To enable generalization of arbitrary classes, existing methods treat class labels as text descriptions, then formulate OVAR as evaluating embedding similarity between visual samples and textual classes. However, one crucial issue is completely ignored: the class descriptions given by users may be noisy, e.g., misspellings and typos, limiting the real-world practicality of vanilla OVAR. To fill the research gap, this paper pioneers to evaluate existing methods by simulating multi-level noises of various types, and reveals their poor robustness. To tackle the noisy OVAR task, we further propose one novel DENOISER framework, covering two parts: generation and discrimination. Concretely, the generative part denoises noisy class-text names via one decoding process, i.e., propose text candidates, then utilize inter-modal and intra-modal information to vote for the best. At the discriminative part, we use vanilla OVAR models to assign visual samples to class-text names, thus obtaining more semantics. For optimization, we alternately iterate between generative and discriminative parts for progressive refinements. The denoised text classes help OVAR models classify visual samples more accurately; in return, classified visual samples help better denoising. On three datasets, we carry out extensive experiments to show our superior robustness, and thorough ablations to dissect the effectiveness of each component.
Visual deep learning (VDL) systems have shown significant success in real-world applications like image recognition, object detection, and autonomous driving. To evaluate the reliability of VDL, a mainstream approach is software testing, which requires diverse and controllable mutations over image semantics. The rapid development of multi-modal large language models (MLLMs) has introduced revolutionary image mutation potentials through instruction-driven methods. Users can now freely describe desired mutations and let MLLMs generate the mutated images. However, the quality of MLLM-produced test inputs in VDL testing remains largely unexplored. We present the first study, aiming to assess MLLMs' adequacy from 1) the semantic validity of MLLM mutated images, 2) the alignment of MLLM mutated images with their text instructions (prompts), 3) the faithfulness of how different mutations preserve semantics that are ought to remain unchanged, and 4) the effectiveness of detecting VDL faults. With large-scale human studies and quantitative evaluations, we identify MLLM's promising potentials in expanding the covered semantics of image mutations. Notably, while SoTA MLLMs (e.g., GPT-4V) fail to support or perform worse in editing existing semantics in images (as in traditional mutations like rotation), they generate high-quality test inputs using "semantic-additive" mutations (e.g., "dress a dog with clothes"), which bring extra semantics to images; these were infeasible for past approaches. Hence, we view MLLM-based mutations as a vital complement to traditional mutations, and advocate future VDL testing tasks to combine MLLM-based methods and traditional image mutations for comprehensive and reliable testing.
Over the past decade, deep learning helped solve manipulation problems across all domains of robotics. At the same time, industrial robots continue to be programmed overwhelmingly using traditional program representations and interfaces. This paper undertakes an analysis of this "AI adoption gap" from an industry practitioner's perspective. In response, we propose the BANSAI approach (Bridging the AI Adoption Gap via Neurosymbolic AI). It systematically leverages principles of neurosymbolic AI to establish data-driven, subsymbolic program synthesis and optimization in modern industrial robot programming workflow. BANSAI conceptually unites several lines of prior research and proposes a path toward practical, real-world validation.
Batch Normalization's (BN) unique property of depending on other samples in a batch is known to cause problems in several tasks, including sequence modeling. Yet, BN-related issues are hardly studied for long video understanding, despite the ubiquitous use of BN in CNNs (Convolutional Neural Networks) for feature extraction. Especially in surgical workflow analysis, where the lack of pretrained feature extractors has led to complex, multi-stage training pipelines, limited awareness of BN issues may have hidden the benefits of training CNNs and temporal models end to end. In this paper, we analyze pitfalls of BN in video learning, including issues specific to online tasks such as a 'cheating' effect in anticipation. We observe that BN's properties create major obstacles for end-to-end learning. However, using BN-free backbones, even simple CNN-LSTMs beat the state of the art {\color{\colorrevtwo}on three surgical workflow benchmarks} by utilizing adequate end-to-end training strategies which maximize temporal context. We conclude that awareness of BN's pitfalls is crucial for effective end-to-end learning in surgical tasks. By reproducing results on natural-video datasets, we hope our insights will benefit other areas of video learning as well. Code is available at: \url{//gitlab.com/nct_tso_public/pitfalls_bn}
This manuscript presents a novel Bayesian varying coefficient quantile regression (BVCQR) model designed to assess the longitudinal effects of chemical exposure mixtures on children's neurodevelopment. Recognizing the complexity and high-dimensionality of environmental exposures, the proposed approach addresses critical gaps in existing research by offering a method that can manage the sparsity of data and provide interpretable results. The proposed BVCQR model estimates the effects of mixtures on neurodevelopmental outcomes at specific ages, leveraging a horseshoe prior for sparsity and utilizing a Bayesian method for uncertainty quantification. Our simulations demonstrate the model's robustness and effectiveness in handling high-dimensional data, offering significant improvements over traditional models. The model's application to the Health Outcomes and Measures of the Environment (HOME) Study further illustrates its utility in identifying significant chemical exposures affecting children's growth and development. The findings underscore the potential of BVCQR in environmental health research, providing a sophisticated tool for analyzing the longitudinal impact of complex chemical mixtures, with implications for future studies aimed at understanding and mitigating environmental risks to child health.
With the rapid advancements in Multimodal Large Language Models (MLLMs), securing these models against malicious inputs while aligning them with human values has emerged as a critical challenge. In this paper, we investigate an important and unexplored question of whether techniques that successfully jailbreak Large Language Models (LLMs) can be equally effective in jailbreaking MLLMs. To explore this issue, we introduce JailBreakV-28K, a pioneering benchmark designed to assess the transferability of LLM jailbreak techniques to MLLMs, thereby evaluating the robustness of MLLMs against diverse jailbreak attacks. Utilizing a dataset of 2, 000 malicious queries that is also proposed in this paper, we generate 20, 000 text-based jailbreak prompts using advanced jailbreak attacks on LLMs, alongside 8, 000 image-based jailbreak inputs from recent MLLMs jailbreak attacks, our comprehensive dataset includes 28, 000 test cases across a spectrum of adversarial scenarios. Our evaluation of 10 open-source MLLMs reveals a notably high Attack Success Rate (ASR) for attacks transferred from LLMs, highlighting a critical vulnerability in MLLMs that stems from their text-processing capabilities. Our findings underscore the urgent need for future research to address alignment vulnerabilities in MLLMs from both textual and visual inputs.
The dominating NLP paradigm of training a strong neural predictor to perform one task on a specific dataset has led to state-of-the-art performance in a variety of applications (eg. sentiment classification, span-prediction based question answering or machine translation). However, it builds upon the assumption that the data distribution is stationary, ie. that the data is sampled from a fixed distribution both at training and test time. This way of training is inconsistent with how we as humans are able to learn from and operate within a constantly changing stream of information. Moreover, it is ill-adapted to real-world use cases where the data distribution is expected to shift over the course of a model's lifetime. The first goal of this thesis is to characterize the different forms this shift can take in the context of natural language processing, and propose benchmarks and evaluation metrics to measure its effect on current deep learning architectures. We then proceed to take steps to mitigate the effect of distributional shift on NLP models. To this end, we develop methods based on parametric reformulations of the distributionally robust optimization framework. Empirically, we demonstrate that these approaches yield more robust models as demonstrated on a selection of realistic problems. In the third and final part of this thesis, we explore ways of efficiently adapting existing models to new domains or tasks. Our contribution to this topic takes inspiration from information geometry to derive a new gradient update rule which alleviate catastrophic forgetting issues during adaptation.
Deep neural networks (DNNs) are successful in many computer vision tasks. However, the most accurate DNNs require millions of parameters and operations, making them energy, computation and memory intensive. This impedes the deployment of large DNNs in low-power devices with limited compute resources. Recent research improves DNN models by reducing the memory requirement, energy consumption, and number of operations without significantly decreasing the accuracy. This paper surveys the progress of low-power deep learning and computer vision, specifically in regards to inference, and discusses the methods for compacting and accelerating DNN models. The techniques can be divided into four major categories: (1) parameter quantization and pruning, (2) compressed convolutional filters and matrix factorization, (3) network architecture search, and (4) knowledge distillation. We analyze the accuracy, advantages, disadvantages, and potential solutions to the problems with the techniques in each category. We also discuss new evaluation metrics as a guideline for future research.