In dialogue, the addressee may initially misunderstand the speaker and respond erroneously, often prompting the speaker to correct the misunderstanding in the next turn with a Third Position Repair (TPR). The ability to process and respond appropriately to such repair sequences is thus crucial in conversational AI systems. In this paper, we first collect, analyse, and publicly release BlockWorld-Repairs: a dataset of multi-modal TPR sequences in an instruction-following manipulation task that is, by design, rife with referential ambiguity. We employ this dataset to evaluate several state-of-the-art Vision and Language Models (VLM) across multiple settings, focusing on their capability to process and accurately respond to TPRs and thus recover from miscommunication. We find that, compared to humans, all models significantly underperform in this task. We then show that VLMs can benefit from specialised losses targeting relevant tokens during fine-tuning, achieving better performance and generalising better to new scenarios. Our results suggest that these models are not yet ready to be deployed in multi-modal collaborative settings where repairs are common, and highlight the need to design training regimes and objectives that facilitate learning from interaction. Our code and data are available at www.github.com/JChiyah/blockworld-repairs
Modern software for propositional satisfiability problems gives a powerful automated reasoning toolkit, capable of outputting not only a satisfiable/unsatisfiable signal but also a justification of unsatisfiability in the form of resolution proof (or a more expressive proof), which is commonly used for verification purposes. Empirically, modern SAT solvers produce relatively short proofs, however, there are no inherent guarantees that these proofs cannot be significantly reduced. This paper proposes a novel branch-and-bound algorithm for finding the shortest resolution proofs; to this end, we introduce a layer list representation of proofs that groups clauses by their level of indirection. As we show, this representation breaks all permutational symmetries, thereby improving upon the state-of-the-art symmetry-breaking and informing the design of a novel workflow for proof minimization. In addition to that, we design pruning procedures that reason on proof length lower bound, clause subsumption, and dominance. Our experiments suggest that the proofs from state-of-the-art solvers could be shortened by 30-60% on the instances from SAT Competition 2002 and by 25-50% on small synthetic formulas. When treated as an algorithm for finding the shortest proof, our approach solves twice as many instances as the previous work based on SAT solving and reduces the time to optimality by orders of magnitude for the instances solved by both approaches.
Despite ongoing efforts to defend neural classifiers from adversarial attacks, they remain vulnerable, especially to unseen attacks. In contrast, humans are difficult to be cheated by subtle manipulations, since we make judgments only based on essential factors. Inspired by this observation, we attempt to model label generation with essential label-causative factors and incorporate label-non-causative factors to assist data generation. For an adversarial example, we aim to discriminate the perturbations as non-causative factors and make predictions only based on the label-causative factors. Concretely, we propose a casual diffusion model (CausalDiff) that adapts diffusion models for conditional data generation and disentangles the two types of casual factors by learning towards a novel casual information bottleneck objective. Empirically, CausalDiff has significantly outperformed state-of-the-art defense methods on various unseen attacks, achieving an average robustness of 86.39% (+4.01%) on CIFAR-10, 56.25% (+3.13%) on CIFAR-100, and 82.62% (+4.93%) on GTSRB (German Traffic Sign Recognition Benchmark).
Recently, there has been a significant upsurge of interest in leveraging large language models (LLMs) to assist scientific discovery. However, most LLMs only focus on general science, while they lack domain-specific knowledge, such as chemical molecules and amino acid sequences. To bridge these gaps, we introduce SciDFM, a mixture-of-experts LLM, which is trained from scratch and is able to conduct college-level scientific reasoning and understand molecules and amino acid sequences. We collect a large-scale training corpus containing numerous scientific papers and books from different disciplines as well as data from domain-specific databases. We further fine-tune the pre-trained model on lots of instruction data to improve performances on downstream benchmarks. From experiment results, we show that SciDFM achieves strong performance on general scientific benchmarks such as SciEval and SciQ, and it reaches a SOTA performance on domain-specific benchmarks among models of similar size. We further analyze the expert layers and show that the results of expert selection vary with data from different disciplines. To benefit the broader research community, we open-source SciDFM at //huggingface.co/OpenDFM/SciDFM-MoE-A5.6B-v1.0.
Recent progress in semantic scene understanding has primarily been enabled by the availability of semantically annotated bi-modal (camera and LiDAR) datasets in urban environments. However, such annotated datasets are also needed for natural, unstructured environments to enable semantic perception for applications, including conservation, search and rescue, environment monitoring, and agricultural automation. Therefore, we introduce $WildScenes$, a bi-modal benchmark dataset consisting of multiple large-scale, sequential traversals in natural environments, including semantic annotations in high-resolution 2D images and dense 3D LiDAR point clouds, and accurate 6-DoF pose information. The data is (1) trajectory-centric with accurate localization and globally aligned point clouds, (2) calibrated and synchronized to support bi-modal training and inference, and (3) containing different natural environments over 6 months to support research on domain adaptation. Our 3D semantic labels are obtained via an efficient, automated process that transfers the human-annotated 2D labels from multiple views into 3D point cloud sequences, thus circumventing the need for expensive and time-consuming human annotation in 3D. We introduce benchmarks on 2D and 3D semantic segmentation and evaluate a variety of recent deep-learning techniques to demonstrate the challenges in semantic segmentation in natural environments. We propose train-val-test splits for standard benchmarks as well as domain adaptation benchmarks and utilize an automated split generation technique to ensure the balance of class label distributions. The $WildScenes$ benchmark webpage is //csiro-robotics.github.io/WildScenes, and the data is publicly available at //data.csiro.au/collection/csiro:61541 .
As conversational AI systems increasingly permeate the socio-emotional realms of human life, they bring both benefits and risks to individuals and society. Despite extensive research on detecting and categorizing harms in AI systems, less is known about the harms that arise from social interactions with AI chatbots. Through a mixed-methods analysis of 35,390 conversation excerpts shared on r/replika, an online community for users of the AI companion Replika, we identified six categories of harmful behaviors exhibited by the chatbot: relational transgression, verbal abuse and hate, self-inflicted harm, harassment and violence, mis/disinformation, and privacy violations. The AI contributes to these harms through four distinct roles: perpetrator, instigator, facilitator, and enabler. Our findings highlight the relational harms of AI chatbots and the danger of algorithmic compliance, enhancing the understanding of AI harms in socio-emotional interactions. We also provide suggestions for designing ethical and responsible AI systems that prioritize user safety and well-being.
Cardinality sketches are compact data structures for representing sets or vectors, enabling efficient approximation of their cardinality (or the number of nonzero entries). These sketches are space-efficient, typically requiring only logarithmic storage relative to input size, and support incremental updates, allowing for dynamic modifications. A critical property of many cardinality sketches is composability, meaning that the sketch of a union of sets can be computed from individual sketches. Existing designs typically provide strong statistical guarantees, accurately answering an exponential number of queries in terms of sketch size $k$. However, these guarantees degrade to quadratic in $k$ when queries are adaptive and may depend on previous responses. Prior works on statistical queries (Steinke and Ullman, 2015) and specific MinHash cardinality sketches (Ahmadian and Cohen, 2024) established that the quadratic bound on the number of adaptive queries is, in fact, unavoidable. In this work, we develop a unified framework that generalizes these results across broad classes of cardinality sketches. We show that any union-composable sketching map is vulnerable to attack with $\tilde{O}(k^4)$ queries and, if the sketching map is also monotone (as for MinHash and statistical queries), we obtain a tight bound of $\tilde{O}(k^2)$ queries. Additionally, we demonstrate that linear sketches over the reals $\mathbb{R}$ and fields $\mathbb{F}_p$ can be attacked using $\tilde{O}(k^2)$ adaptive queries, which is optimal and strengthens some of the recent results by Gribelyuk et al. (2024), which required a larger polynomial number of rounds for such matrices.
By leveraging the representation power of deep neural networks, neural upper confidence bound (UCB) algorithms have shown success in contextual bandits. To further balance the exploration and exploitation, we propose Neural-$\sigma^2$-LinearUCB, a variance-aware algorithm that utilizes $\sigma^2_t$, i.e., an upper bound of the reward noise variance at round $t$, to enhance the uncertainty quantification quality of the UCB, resulting in a regret performance improvement. We provide an oracle version for our algorithm characterized by an oracle variance upper bound $\sigma^2_t$ and a practical version with a novel estimation for this variance bound. Theoretically, we provide rigorous regret analysis for both versions and prove that our oracle algorithm achieves a better regret guarantee than other neural-UCB algorithms in the neural contextual bandits setting. Empirically, our practical method enjoys a similar computational efficiency, while outperforming state-of-the-art techniques by having a better calibration and lower regret across multiple standard settings, including on the synthetic, UCI, MNIST, and CIFAR-10 datasets.
Nowadays, neural networks are commonly used to solve various problems. Unfortunately, despite their effectiveness, they are often perceived as black boxes capable of providing answers without explaining their decisions, which raises numerous ethical and legal concerns. Fortunately, the field of explainability helps users understand these results. This aspect of machine learning allows users to grasp the decision-making process of a model and verify the relevance of its outcomes. In this article, we focus on the learning process carried out by a ``time distributed`` convRNN, which performs anomaly detection from video data.
Human intelligence thrives on the concept of cognitive synergy, where collaboration and information integration among different cognitive processes yield superior outcomes compared to individual cognitive processes in isolation. Although Large Language Models (LLMs) have demonstrated promising performance as general task-solving agents, they still struggle with tasks that require intensive domain knowledge and complex reasoning. In this work, we propose Solo Performance Prompting (SPP), which transforms a single LLM into a cognitive synergist by engaging in multi-turn self-collaboration with multiple personas. A cognitive synergist refers to an intelligent agent that collaborates with multiple minds, combining their individual strengths and knowledge, to enhance problem-solving and overall performance in complex tasks. By dynamically identifying and simulating different personas based on task inputs, SPP unleashes the potential of cognitive synergy in LLMs. We have discovered that assigning multiple, fine-grained personas in LLMs elicits better problem-solving abilities compared to using a single or fixed number of personas. We evaluate SPP on three challenging tasks: Trivia Creative Writing, Codenames Collaborative, and Logic Grid Puzzle, encompassing both knowledge-intensive and reasoning-intensive types. Unlike previous works, such as Chain-of-Thought, that solely enhance the reasoning abilities in LLMs, SPP effectively elicits internal knowledge acquisition abilities, reduces hallucination, and maintains strong reasoning capabilities. Code, data, and prompts can be found at: //github.com/MikeWangWZHL/Solo-Performance-Prompting.git.
Image segmentation is still an open problem especially when intensities of the interested objects are overlapped due to the presence of intensity inhomogeneity (also known as bias field). To segment images with intensity inhomogeneities, a bias correction embedded level set model is proposed where Inhomogeneities are Estimated by Orthogonal Primary Functions (IEOPF). In the proposed model, the smoothly varying bias is estimated by a linear combination of a given set of orthogonal primary functions. An inhomogeneous intensity clustering energy is then defined and membership functions of the clusters described by the level set function are introduced to rewrite the energy as a data term of the proposed model. Similar to popular level set methods, a regularization term and an arc length term are also included to regularize and smooth the level set function, respectively. The proposed model is then extended to multichannel and multiphase patterns to segment colourful images and images with multiple objects, respectively. It has been extensively tested on both synthetic and real images that are widely used in the literature and public BrainWeb and IBSR datasets. Experimental results and comparison with state-of-the-art methods demonstrate that advantages of the proposed model in terms of bias correction and segmentation accuracy.