Polynomial multiplication is one of the fundamental operations in many applications, such as fully homomorphic encryption (FHE). However, the computational inefficiency stemming from polynomials with many large-bit coefficients poses a significant challenge for the practical implementation of FHE. The Number Theoretic Transform (NTT) has proven an effective tool in enhancing polynomial multiplication, but a fast and adaptable method for generating NTT accelerators is lacking. In this paper, we introduce HF-NTT, a novel NTT accelerator. HF-NTT efficiently handles polynomials of varying degrees and moduli, allowing for a balance between performance and hardware resources by adjusting the number of Processing Elements (PEs). Meanwhile, we introduce a data movement strategy that eliminates the need for bit-reversal operations, resolves different hazards, and reduces the clock cycles. Furthermore, Our accelerator includes a hardware-friendly modular multiplication design and a configurable PE capable of adapting its data path, resulting in a universal architecture. We synthesized and implemented prototype using Vivado 2022.2, and evaluated it on the Xilinx Virtex-7 FPGA platform. The results demonstrate significant improvements in Area-Time-Product (ATP) and processing speed for different polynomial degrees. In scenarios involving multi-modulus polynomial multiplication, our prototype consistently outperforms other designs in both ATP and latency metrics.
Effective preference tuning is pivotal in aligning chatbot responses with human expectations, enhancing user satisfaction and engagement. Traditional approaches, notably Reinforcement Learning from Human Feedback (RLHF) as employed in advanced models like GPT-4, have demonstrated considerable success in this domain. However, RLHF methods are often computationally intensive and resource-demanding, limiting their scalability and accessibility for broader applications. To address these challenges, this study introduces LoRA-Lite Ensemble (LoRA-LiteE), an innovative framework that combines Supervised Fine-tuning (SFT) with Low-Rank Adaptation (LoRA) and Ensemble Learning techniques to effectively aggregate predictions of lightweight models, which aim to achieve a balance between the performance and computational cost. Utilizing the Chatbot Arena benchmark dataset, we conduct a comprehensive comparative analysis among our LoRA-LiteE model, corresponding base models at different scales, and GPT-4 trained with RLHF. Our empirical results demonstrate that the proposed LoRA-LiteE model achieves comparable performance to un-finetuned GPT-4 and outperforms the single larger-scale models under limited resource constraints. These findings highlight that our LoRA-LiteE provides a feasible and efficient methodology for human preference prediction in chatbot systems, enhancing scalability and accessibility, and thereby broadening the applicability of preference-tuned chatbots in resource-constrained environments.
The neutral atom array has gained prominence in quantum computing for its scalability and operation fidelity. Previous works focus on fixed atom arrays (FAAs) that require extensive SWAP operations for long-range interactions. This work explores a novel architecture reconfigurable atom arrays (RAAs), also known as field programmable qubit arrays (FPQAs), which allows for coherent atom movements during circuit execution under some constraints. Such atom movements, which are unique to this architecture, could reduce the cost of long-range interactions significantly if the atom movements could be scheduled strategically. In this work, we introduce Atomique, a compilation framework designed for qubit mapping, atom movement, and gate scheduling for RAA. Atomique contains a qubit-array mapper to decide the coarse-grained mapping of the qubits to arrays, leveraging MAX k-Cut on a constructed gate frequency graph to minimize SWAP overhead. Subsequently, a qubit-atom mapper determines the fine-grained mapping of qubits to specific atoms in the array and considers load balance to prevent hardware constraint violations. We further propose a router that identifies parallel gates, schedules them simultaneously, and reduces depth. We evaluate Atomique across 20+ diverse benchmarks, including generic circuits (arbitrary, QASMBench, SupermarQ), quantum simulation, and QAOA circuits. Atomique consistently outperforms IBM Superconducting, FAA with long-range gates, and FAA with rectangular and triangular topologies, achieving significant reductions in depth and the number of two-qubit gates.
By augmenting Large Language Models (LLMs) with external tools, their capacity to solve complex problems has been significantly enhanced. However, despite ongoing advancements in the parsing capabilities of LLMs, incorporating all available tools simultaneously in the prompt remains impractical due to the vast number of external tools. Consequently, it is essential to provide LLMs with a precise set of tools tailored to the specific task, considering both quantity and quality. Current tool retrieval methods primarily focus on refining the ranking list of tools and directly packaging a fixed number of top-ranked tools as the tool set. However, these approaches often fail to equip LLMs with the optimal set of tools prior to execution, since the optimal number of tools for different tasks could be different, resulting in inefficiencies such as redundant or unsuitable tools, which impede immediate access to the most relevant tools. This paper addresses the challenge of recommending precise toolsets for LLMs. We introduce the problem of tool recommendation, define its scope, and propose a novel Precision-driven Tool Recommendation (PTR) approach. PTR captures an initial, concise set of tools by leveraging historical tool bundle usage and dynamically adjusts the tool set by performing tool matching, culminating in a multi-view-based tool addition. Additionally, we present a new dataset, RecTools, and a metric, TRACC, designed to evaluate the effectiveness of tool recommendation for LLMs. We further validate our design choices through comprehensive experiments, demonstrating promising accuracy across two open benchmarks and our RecTools dataset.
Various processing-in-memory (PIM) accelerators based on various devices, micro-architectures, and interfaces have been proposed to accelerate deep neural networks (DNNs). How to deploy DNNs onto PIM-based accelerators is the key to explore PIM's high performance and energy efficiency. The scale of DNN models, the diversity of PIM accelerators, and the complexity of deployment are far beyond the human deployment capability. Hence, an automatic deployment methodology is indispensable. In this work, we propose PIMCOMP, an end-to-end DNN compiler tailored for PIM accelerators, achieving efficient deployment of DNN models on PIM hardware. PIMCOMP can adapt to various PIM architectures by using an abstract configurable PIM accelerator template with a set of pseudo-instructions, which is a high-level abstraction of the hardware's fundamental functionalities. Through a generic multi-level optimization framework, PIMCOMP realizes an end-to-end conversion from a high-level DNN description to pseudo-instructions, which can be further converted to specific hardware intrinsics/primitives. The compilation addresses two critical issues in PIM-accelerated inference from a system perspective: resource utilization and dataflow scheduling. PIMCOMP adopts a flexible unfolding format to reshape and partition convolutional layers, adopts a weight-layout guided computation-storage-mapping approach to enhance resource utilization, and balances the system's computation, memory access, and communication characteristics. For dataflow scheduling, we design two scheduling algorithms with different inter-layer pipeline granularities to support varying application scenarios while ensuring high computational parallelism. Experiments demonstrate that PIMCOMP improves throughput, latency, and energy efficiency across various architectures. PIMCOMP is open-sourced at \url{//github.com/sunxt99/PIMCOMP-NN}.
Multi-modal 3D scene understanding has gained considerable attention due to its wide applications in many areas, such as autonomous driving and human-computer interaction. Compared to conventional single-modal 3D understanding, introducing an additional modality not only elevates the richness and precision of scene interpretation but also ensures a more robust and resilient understanding. This becomes especially crucial in varied and challenging environments where solely relying on 3D data might be inadequate. While there has been a surge in the development of multi-modal 3D methods over past three years, especially those integrating multi-camera images (3D+2D) and textual descriptions (3D+language), a comprehensive and in-depth review is notably absent. In this article, we present a systematic survey of recent progress to bridge this gap. We begin by briefly introducing a background that formally defines various 3D multi-modal tasks and summarizes their inherent challenges. After that, we present a novel taxonomy that delivers a thorough categorization of existing methods according to modalities and tasks, exploring their respective strengths and limitations. Furthermore, comparative results of recent approaches on several benchmark datasets, together with insightful analysis, are offered. Finally, we discuss the unresolved issues and provide several potential avenues for future research.
Deep learning has shown great potential for modeling the physical dynamics of complex particle systems such as fluids (in Lagrangian descriptions). Existing approaches, however, require the supervision of consecutive particle properties, including positions and velocities. In this paper, we consider a partially observable scenario known as fluid dynamics grounding, that is, inferring the state transitions and interactions within the fluid particle systems from sequential visual observations of the fluid surface. We propose a differentiable two-stage network named NeuroFluid. Our approach consists of (i) a particle-driven neural renderer, which involves fluid physical properties into the volume rendering function, and (ii) a particle transition model optimized to reduce the differences between the rendered and the observed images. NeuroFluid provides the first solution to unsupervised learning of particle-based fluid dynamics by training these two models jointly. It is shown to reasonably estimate the underlying physics of fluids with different initial shapes, viscosity, and densities. It is a potential alternative approach to understanding complex fluid mechanics, such as turbulence, that are difficult to model using traditional methods of mathematical physics.
Multi-modal fusion is a fundamental task for the perception of an autonomous driving system, which has recently intrigued many researchers. However, achieving a rather good performance is not an easy task due to the noisy raw data, underutilized information, and the misalignment of multi-modal sensors. In this paper, we provide a literature review of the existing multi-modal-based methods for perception tasks in autonomous driving. Generally, we make a detailed analysis including over 50 papers leveraging perception sensors including LiDAR and camera trying to solve object detection and semantic segmentation tasks. Different from traditional fusion methodology for categorizing fusion models, we propose an innovative way that divides them into two major classes, four minor classes by a more reasonable taxonomy in the view of the fusion stage. Moreover, we dive deep into the current fusion methods, focusing on the remaining problems and open-up discussions on the potential research opportunities. In conclusion, what we expect to do in this paper is to present a new taxonomy of multi-modal fusion methods for the autonomous driving perception tasks and provoke thoughts of the fusion-based techniques in the future.
Out-of-distribution (OOD) detection is critical to ensuring the reliability and safety of machine learning systems. For instance, in autonomous driving, we would like the driving system to issue an alert and hand over the control to humans when it detects unusual scenes or objects that it has never seen before and cannot make a safe decision. This problem first emerged in 2017 and since then has received increasing attention from the research community, leading to a plethora of methods developed, ranging from classification-based to density-based to distance-based ones. Meanwhile, several other problems are closely related to OOD detection in terms of motivation and methodology. These include anomaly detection (AD), novelty detection (ND), open set recognition (OSR), and outlier detection (OD). Despite having different definitions and problem settings, these problems often confuse readers and practitioners, and as a result, some existing studies misuse terms. In this survey, we first present a generic framework called generalized OOD detection, which encompasses the five aforementioned problems, i.e., AD, ND, OSR, OOD detection, and OD. Under our framework, these five problems can be seen as special cases or sub-tasks, and are easier to distinguish. Then, we conduct a thorough review of each of the five areas by summarizing their recent technical developments. We conclude this survey with open challenges and potential research directions.
Generative commonsense reasoning which aims to empower machines to generate sentences with the capacity of reasoning over a set of concepts is a critical bottleneck for text generation. Even the state-of-the-art pre-trained language generation models struggle at this task and often produce implausible and anomalous sentences. One reason is that they rarely consider incorporating the knowledge graph which can provide rich relational information among the commonsense concepts. To promote the ability of commonsense reasoning for text generation, we propose a novel knowledge graph augmented pre-trained language generation model KG-BART, which encompasses the complex relations of concepts through the knowledge graph and produces more logical and natural sentences as output. Moreover, KG-BART can leverage the graph attention to aggregate the rich concept semantics that enhances the model generalization on unseen concept sets. Experiments on benchmark CommonGen dataset verify the effectiveness of our proposed approach by comparing with several strong pre-trained language generation models, particularly KG-BART outperforms BART by 5.80, 4.60, in terms of BLEU-3, 4. Moreover, we also show that the generated context by our model can work as background scenarios to benefit downstream commonsense QA tasks.
Exploration-exploitation is a powerful and practical tool in multi-agent learning (MAL), however, its effects are far from understood. To make progress in this direction, we study a smooth analogue of Q-learning. We start by showing that our learning model has strong theoretical justification as an optimal model for studying exploration-exploitation. Specifically, we prove that smooth Q-learning has bounded regret in arbitrary games for a cost model that explicitly captures the balance between game and exploration costs and that it always converges to the set of quantal-response equilibria (QRE), the standard solution concept for games under bounded rationality, in weighted potential games with heterogeneous learning agents. In our main task, we then turn to measure the effect of exploration in collective system performance. We characterize the geometry of the QRE surface in low-dimensional MAL systems and link our findings with catastrophe (bifurcation) theory. In particular, as the exploration hyperparameter evolves over-time, the system undergoes phase transitions where the number and stability of equilibria can change radically given an infinitesimal change to the exploration parameter. Based on this, we provide a formal theoretical treatment of how tuning the exploration parameter can provably lead to equilibrium selection with both positive as well as negative (and potentially unbounded) effects to system performance.