The widespread adoption of voice-activated systems has modified routine human-machine interaction but has also introduced new vulnerabilities. This paper investigates the susceptibility of automatic speech recognition (ASR) algorithms in these systems to interference from near-ultrasonic noise. Building upon prior research that demonstrated the ability of near-ultrasonic frequencies (16 kHz - 22 kHz) to exploit the inherent properties of microelectromechanical systems (MEMS) microphones, our study explores alternative privacy enforcement means using this interference phenomenon. We expose a critical vulnerability in the most common microphones used in modern voice-activated devices, which inadvertently demodulate near-ultrasonic frequencies into the audible spectrum, disrupting the ASR process. Through a systematic analysis of the impact of near-ultrasonic noise on various ASR systems, we demonstrate that this vulnerability is consistent across different devices and under varying conditions, such as broadcast distance and specific phoneme structures. Our findings highlight the need to develop robust countermeasures to protect voice-activated systems from malicious exploitation of this vulnerability. Furthermore, we explore the potential applications of this phenomenon in enhancing privacy by disrupting unauthorized audio recording or eavesdropping. This research underscores the importance of a comprehensive approach to securing voice-activated systems, combining technological innovation, responsible development practices, and informed policy decisions to ensure the privacy and security of users in an increasingly connected world.
Accurate reconstruction of complex dynamic scenes from just a single viewpoint continues to be a challenging task in computer vision. Current dynamic novel view synthesis methods typically require videos from many different camera viewpoints, necessitating careful recording setups, and significantly restricting their utility in the wild as well as in terms of embodied AI applications. In this paper, we propose $\textbf{GCD}$, a controllable monocular dynamic view synthesis pipeline that leverages large-scale diffusion priors to, given a video of any scene, generate a synchronous video from any other chosen perspective, conditioned on a set of relative camera pose parameters. Our model does not require depth as input, and does not explicitly model 3D scene geometry, instead performing end-to-end video-to-video translation in order to achieve its goal efficiently. Despite being trained on synthetic multi-view video data only, zero-shot real-world generalization experiments show promising results in multiple domains, including robotics, object permanence, and driving environments. We believe our framework can potentially unlock powerful applications in rich dynamic scene understanding, perception for robotics, and interactive 3D video viewing experiences for virtual reality.
Large language models (LLMs) have been shown to face hallucination issues due to the data they trained on often containing human bias; whether this is reflected in the decision-making process of LLM agents remains under-explored. As LLM Agents are increasingly employed in intricate social environments, a pressing and natural question emerges: Can LLM Agents leverage hallucinations to mirror human cognitive biases, thus exhibiting irrational social intelligence? In this paper, we probe the irrational behavior among contemporary LLM agents by melding practical social science experiments with theoretical insights. Specifically, We propose CogMir, an open-ended Multi-LLM Agents framework that utilizes hallucination properties to assess and enhance LLM Agents' social intelligence through cognitive biases. Experimental results on CogMir subsets show that LLM Agents and humans exhibit high consistency in irrational and prosocial decision-making under uncertain conditions, underscoring the prosociality of LLM Agents as social entities, and highlighting the significance of hallucination properties. Additionally, CogMir framework demonstrates its potential as a valuable platform for encouraging more research into the social intelligence of LLM Agents.
The rapid development of collaborative robotics has provided a new possibility of helping the elderly who has difficulties in daily life, allowing robots to operate according to specific intentions. However, efficient human-robot cooperation requires natural, accurate and reliable intention recognition in shared environments. The current paramount challenge for this is reducing the uncertainty of multimodal fused intention to be recognized and reasoning adaptively a more reliable result despite current interactive condition. In this work we propose a novel learning-based multimodal fusion framework Batch Multimodal Confidence Learning for Opinion Pool (BMCLOP). Our approach combines Bayesian multimodal fusion method and batch confidence learning algorithm to improve accuracy, uncertainty reduction and success rate given the interactive condition. In particular, the generic and practical multimodal intention recognition framework can be easily extended further. Our desired assistive scenarios consider three modalities gestures, speech and gaze, all of which produce categorical distributions over all the finite intentions. The proposed method is validated with a six-DoF robot through extensive experiments and exhibits high performance compared to baselines.
The drive to develop artificial neural networks that efficiently utilize resources has generated significant interest in bio-inspired Spiking Neural Networks (SNNs). These networks are particularly attractive due to their potential in applications requiring low power and memory. This potential is further enhanced by the ability to perform online local learning, enabling them to adapt to dynamic environments. This requires the model to be adaptive in a self-supervised manner. While self-supervised learning has seen great success in many deep learning domains, its application for online local learning in multi-layer SNNs remains underexplored. In this paper, we introduce the "EchoSpike Predictive Plasticity" (ESPP) learning rule, a pioneering online local learning rule designed to leverage hierarchical temporal dynamics in SNNs through predictive and contrastive coding. We validate the effectiveness of this approach using benchmark datasets, demonstrating that it performs on par with current state-of-the-art supervised learning rules. The temporal and spatial locality of ESPP makes it particularly well-suited for low-cost neuromorphic processors, representing a significant advancement in developing biologically plausible self-supervised learning models for neuromorphic computing at the edge.
Federated self-supervised learning (FSSL) has recently emerged as a promising paradigm that enables the exploitation of clients' vast amounts of unlabeled data while preserving data privacy. While FSSL offers advantages, its susceptibility to backdoor attacks, a concern identified in traditional federated supervised learning (FSL), has not been investigated. To fill the research gap, we undertake a comprehensive investigation into a backdoor attack paradigm, where unscrupulous clients conspire to manipulate the global model, revealing the vulnerability of FSSL to such attacks. In FSL, backdoor attacks typically build a direct association between the backdoor trigger and the target label. In contrast, in FSSL, backdoor attacks aim to alter the global model's representation for images containing the attacker's specified trigger pattern in favor of the attacker's intended target class, which is less straightforward. In this sense, we demonstrate that existing defenses are insufficient to mitigate the investigated backdoor attacks in FSSL, thus finding an effective defense mechanism is urgent. To tackle this issue, we dive into the fundamental mechanism of backdoor attacks on FSSL, proposing the Embedding Inspector (EmInspector) that detects malicious clients by inspecting the embedding space of local models. In particular, EmInspector assesses the similarity of embeddings from different local models using a small set of inspection images (e.g., ten images of CIFAR100) without specific requirements on sample distribution or labels. We discover that embeddings from backdoored models tend to cluster together in the embedding space for a given inspection image. Evaluation results show that EmInspector can effectively mitigate backdoor attacks on FSSL across various adversary settings. Our code is avaliable at //github.com/ShuchiWu/EmInspector.
Optimizing the deployment of Large language models (LLMs) is expensive today since it requires experimentally running an application workload against an LLM implementation while exploring large configuration space formed by system knobs such as parallelization strategies, batching techniques, and scheduling policies. To address this challenge, we present Vidur - a large-scale, high-fidelity, easily-extensible simulation framework for LLM inference performance. Vidur models the performance of LLM operators using a combination of experimental profiling and predictive modeling, and evaluates the end-to-end inference performance for different workloads by estimating several metrics of interest such as latency and throughput. We validate the fidelity of Vidur on several LLMs and show that it estimates inference latency with less than 9% error across the range. Further, we present Vidur-Search, a configuration search tool that helps optimize LLM deployment. Vidur-Search uses Vidur to automatically identify the most cost-effective deployment configuration that meets application performance constraints. For example, Vidur-Search finds the best deployment configuration for LLaMA2-70B in one hour on a CPU machine, in contrast to a deployment-based exploration which would require 42K GPU hours - costing ~218K dollars. Source code for Vidur is available at //github.com/microsoft/vidur.
Large language models (LLMs) such as OpenAI's ChatGPT and Google's Gemini have demonstrated unprecedented capabilities of autoregressive AI models across multiple tasks triggering disruptive technology innovations around the world. However, as models continue to grow the cost to serve these models also continues to grow threatening the democratization of LLMs. To address this issue, we propose Chiplet Cloud, a chiplet-based ASIC LLM-supercomputer architecture whose goal is to optimize the total cost of ownership (TCO) per generated token. This architecture is a highly parameterizable ASIC and server-level architecture leveraging thousands of replicated accelerator modules collaborating to scale-up the performance of LLMs at cloud-scale. To determine specific parameterizations of the Chiplet Cloud architecture, we implemented a two-phase hardware-software co-design methodology that can search the massive design space and fine tune the architecture across a collection of LLMs based on an accurate inference simulation. A common bottleneck for LLMs is the memory access performance therefore we introduce CC-MEM, a scalable on-chip memory system for Chiplet Cloud architectures. Using the CC-MEM, Chiplet Clouds can be built using only SRAMs for design points where the power and performance of memory access is critical. The CC-MEM also includes a compression decoder module to add support for sparse models without impacting the compute units using a Store-as-Compressed, Load-as-Dense mechanism. We evaluate Chiplet Cloud architectures across eight popular LLMs. Using fine tuned Chiplet Cloud servers we are able to achieve $97\times$ and $18\times$ improvement in TCO/Token over rented GPU and TPU clouds, or a $8.3\times$ and $3.7\times$ improvement over fabricated GPU and TPU clouds respectively. Chiplet Cloud can also support $1.7\times$ larger models with a sparsity of 60\%.
The past decade has witnessed a plethora of works that leverage the power of visualization (VIS) to interpret machine learning (ML) models. The corresponding research topic, VIS4ML, keeps growing at a fast pace. To better organize the enormous works and shed light on the developing trend of VIS4ML, we provide a systematic review of these works through this survey. Since data quality greatly impacts the performance of ML models, our survey focuses specifically on summarizing VIS4ML works from the data perspective. First, we categorize the common data handled by ML models into five types, explain the unique features of each type, and highlight the corresponding ML models that are good at learning from them. Second, from the large number of VIS4ML works, we tease out six tasks that operate on these types of data (i.e., data-centric tasks) at different stages of the ML pipeline to understand, diagnose, and refine ML models. Lastly, by studying the distribution of 143 surveyed papers across the five data types, six data-centric tasks, and their intersections, we analyze the prospective research directions and envision future research trends.
The emergence of large language models (LLMs) has substantially influenced natural language processing, demonstrating exceptional results across various tasks. In this study, we employ ``Introspective Tips" to facilitate LLMs in self-optimizing their decision-making. By introspectively examining trajectories, LLM refines its policy by generating succinct and valuable tips. Our method enhances the agent's performance in both few-shot and zero-shot learning situations by considering three essential scenarios: learning from the agent's past experiences, integrating expert demonstrations, and generalizing across diverse games. Importantly, we accomplish these improvements without fine-tuning the LLM parameters; rather, we adjust the prompt to generalize insights from the three aforementioned situations. Our framework not only supports but also emphasizes the advantage of employing LLM in in-contxt decision-making. Experiments involving over 100 games in TextWorld illustrate the superior performance of our approach.
Convolutional Neural Networks (CNNs) have gained significant traction in the field of machine learning, particularly due to their high accuracy in visual recognition. Recent works have pushed the performance of GPU implementations of CNNs to significantly improve their classification and training times. With these improvements, many frameworks have become available for implementing CNNs on both CPUs and GPUs, with no support for FPGA implementations. In this work we present a modified version of the popular CNN framework Caffe, with FPGA support. This allows for classification using CNN models and specialized FPGA implementations with the flexibility of reprogramming the device when necessary, seamless memory transactions between host and device, simple-to-use test benches, and the ability to create pipelined layer implementations. To validate the framework, we use the Xilinx SDAccel environment to implement an FPGA-based Winograd convolution engine and show that the FPGA layer can be used alongside other layers running on a host processor to run several popular CNNs (AlexNet, GoogleNet, VGG A, Overfeat). The results show that our framework achieves 50 GFLOPS across 3x3 convolutions in the benchmarks. This is achieved within a practical framework, which will aid in future development of FPGA-based CNNs.