The impressive capabilities of Large Language Models (LLMs) have led to various efforts to enable robots to be controlled through natural language instructions, opening exciting possibilities for human-robot interaction The goal is for the motor-control task to be performed accurately, efficiently and safely while also enjoying the flexibility imparted by LLMs to specify and adjust the task through natural language. In this work, we demonstrate how a careful layering of an LLM in combination with a Model Predictive Control (MPC) formulation allows for accurate and flexible robotic control via natural language while taking into consideration safety constraints. In particular, we rely on the LLM to effectively frame constraints and objective functions as mathematical expressions, which are later used in the motor-control module via MPC. The transparency of the optimization formulation allows for interpretability of the task and enables adjustments through human feedback. We demonstrate the validity of our method through extensive experiments on long-horizon reasoning, contact-rich, and multi-object interaction tasks. Our evaluations show that NARRATE outperforms current existing methods on these benchmarks and effectively transfers to the real world on two different embodiments. Videos, Code and Prompts at narrate-mpc.github.io
Virtual Private Cloud (VPC) is the main network abstraction technology used in public cloud systems. VPCs are composed of a set of network services that permit the definition of complex network reachability properties among internal and external cloud entities such as tenants' VMs or some generic internet nodes. Although hiding the underlying complexity through a comprehensible abstraction layer, manually enforcing particular reachability intents in VPC networks is still notably error-prone and complex. In this paper, we propose AutoNet, a new model for assisting cloud tenants in managing reachability-based policies in VPC networks. AutoNet is capable of safely generating incremental VPC configurations while satisfying some metric-based high-level intent defined by the tenants. To achieve this goal, we leverage a MaxSAT-based encoding of the network configuration combined with several optimizations to scale to topologies with thousands of nodes. Our results show that the developed system is capable of achieving a sub-second response time for production VPC deployments while still providing fine-grained control over the generated configurations.
Adapting Large Language Models (LLMs) to new tasks through fine-tuning has been made more efficient by the introduction of Parameter-Efficient Fine-Tuning (PEFT) techniques, such as LoRA. However, these methods often underperform compared to full fine-tuning, particularly in scenarios involving complex datasets. This issue becomes even more pronounced in complex domains, highlighting the need for improved PEFT approaches that can achieve better performance. Through a series of experiments, we have uncovered two critical insights that shed light on the training and parameter inefficiency of LoRA. Building on these insights, we have developed HydraLoRA, a LoRA framework with an asymmetric structure that eliminates the need for domain expertise. Our experiments demonstrate that HydraLoRA outperforms other PEFT approaches, even those that rely on domain knowledge during the training and inference phases. \href{//github.com/Clin0212/HydraLoRA}{Code}.
The integration of Internet of Things (IoT) applications in our daily lives has led to a surge in data traffic, posing significant security challenges. IoT applications using cloud and edge computing are at higher risk of cyberattacks because of the expanded attack surface from distributed edge and cloud services, the vulnerability of IoT devices, and challenges in managing security across interconnected systems leading to oversights. This led to the rise of ML-based solutions for intrusion detection systems (IDSs), which have proven effective in enhancing network security and defending against diverse threats. However, ML-based IDS in IoT systems encounters challenges, particularly from noisy, redundant, and irrelevant features in varied IoT datasets, potentially impacting its performance. Therefore, reducing such features becomes crucial to enhance system performance and minimize computational costs. This paper focuses on improving the effectiveness of ML-based IDS at the edge level by introducing a novel method to find a balanced trade-off between cost and accuracy through the creation of informative features in a two-tier edge-user IoT environment. A hybrid Binary Quantum-inspired Artificial Bee Colony and Genetic Programming algorithm is utilized for this purpose. Three IoT intrusion detection datasets, namely NSL-KDD, UNSW-NB15, and BoT-IoT, are used for the evaluation of the proposed approach.
The adoption of artificial intelligence (AI) across industries has led to the widespread use of complex black-box models and interpretation tools for decision making. This paper proposes an adversarial framework to uncover the vulnerability of permutation-based interpretation methods for machine learning tasks, with a particular focus on partial dependence (PD) plots. This adversarial framework modifies the original black box model to manipulate its predictions for instances in the extrapolation domain. As a result, it produces deceptive PD plots that can conceal discriminatory behaviors while preserving most of the original model's predictions. This framework can produce multiple fooled PD plots via a single model. By using real-world datasets including an auto insurance claims dataset and COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) dataset, our results show that it is possible to intentionally hide the discriminatory behavior of a predictor and make the black-box model appear neutral through interpretation tools like PD plots while retaining almost all the predictions of the original black-box model. Managerial insights for regulators and practitioners are provided based on the findings.
Fairness is steadily becoming a crucial requirement of Machine Learning (ML) systems. A particularly important notion is subgroup fairness, i.e., fairness in subgroups of individuals that are defined by more than one attributes. Identifying bias in subgroups can become both computationally challenging, as well as problematic with respect to comprehensibility and intuitiveness of the finding to end users. In this work we focus on the latter aspects; we propose an explainability method tailored to identifying potential bias in subgroups and visualizing the findings in a user friendly manner to end users. In particular, we extend the ALE plots explainability method, proposing FALE (Fairness aware Accumulated Local Effects) plots, a method for measuring the change in fairness for an affected population corresponding to different values of a feature (attribute). We envision FALE to function as an efficient, user friendly, comprehensible and reliable first-stage tool for identifying subgroups with potential bias issues.
The generative AI technology offers an increasing variety of tools for generating entirely synthetic images that are increasingly indistinguishable from real ones. Unlike methods that alter portions of an image, the creation of completely synthetic images presents a unique challenge and several Synthetic Image Detection (SID) methods have recently appeared to tackle it. Yet, there is often a large gap between experimental results on benchmark datasets and the performance of methods in the wild. To better address the evaluation needs of SID and help close this gap, this paper introduces a benchmarking framework that integrates several state-of-the-art SID models. Our selection of integrated models was based on the utilization of varied input features, and different network architectures, aiming to encompass a broad spectrum of techniques. The framework leverages recent datasets with a diverse set of generative models, high level of photo-realism and resolution, reflecting the rapid improvements in image synthesis technology. Additionally, the framework enables the study of how image transformations, common in assets shared online, such as JPEG compression, affect detection performance. SIDBench is available on //github.com/mever-team/sidbench and is designed in a modular manner to enable easy inclusion of new datasets and SID models.
Despite the advancements and impressive performance of Multimodal Large Language Models (MLLMs) on benchmarks, their effectiveness in real-world, long-context, and multi-image tasks is unclear due to the benchmarks' limited scope. Existing benchmarks often focus on single-image and short-text samples, and when assessing multi-image tasks, they either limit the image count or focus on specific task (e.g time-series captioning), potentially obscuring the performance challenges of MLLMs. To address these limitations, we introduce MileBench, a pioneering benchmark designed to test the MultImodal Long-contExt capabilities of MLLMs. This benchmark comprises not only multimodal long contexts, but also multiple tasks requiring both comprehension and generation. We establish two distinct evaluation sets, diagnostic and realistic, to systematically assess MLLMs' long-context adaptation capacity and their ability to complete tasks in long-context scenarios. Our experimental results, obtained from testing 20 models, revealed that while the closed-source GPT-4(Vision) and Gemini 1.5 outperform others, most open-source MLLMs struggle in long-context situations. Interestingly, the performance gap tends to widen with an increase in the number of images. We strongly encourage an intensification of research efforts towards enhancing MLLMs' long-context capabilities, especially in scenarios involving multiple images.
Deep neural network (DNN) video analytics is crucial for autonomous systems such as self-driving vehicles, unmanned aerial vehicles (UAVs), and security robots. However, real-world deployment faces challenges due to their limited computational resources and battery power. To tackle these challenges, continuous learning exploits a lightweight "student" model at deployment (inference), leverages a larger "teacher" model for labeling sampled data (labeling), and continuously retrains the student model to adapt to changing scenarios (retraining). This paper highlights the limitations in state-of-the-art continuous learning systems: (1) they focus on computations for retraining, while overlooking the compute needs for inference and labeling, (2) they rely on power-hungry GPUs, unsuitable for battery-operated autonomous systems, and (3) they are located on a remote centralized server, intended for multi-tenant scenarios, again unsuitable for autonomous systems due to privacy, network availability, and latency concerns. We propose a hardware-algorithm co-designed solution for continuous learning, DaCapo, that enables autonomous systems to perform concurrent executions of inference, labeling, and training in a performant and energy-efficient manner. DaCapo comprises (1) a spatially-partitionable and precision-flexible accelerator enabling parallel execution of kernels on sub-accelerators at their respective precisions, and (2) a spatiotemporal resource allocation algorithm that strategically navigates the resource-accuracy tradeoff space, facilitating optimal decisions for resource allocation to achieve maximal accuracy. Our evaluation shows that DaCapo achieves 6.5% and 5.5% higher accuracy than a state-of-the-art GPU-based continuous learning systems, Ekya and EOMU, respectively, while consuming 254x less power.
The introduction of ChatGPT has led to a significant increase in the utilization of Large Language Models (LLMs) for addressing downstream tasks. There's an increasing focus on cost-efficient training and deployment within this context. Low-cost training and deployment of LLMs represent the future development trend. This paper reviews the evolution of large language model training techniques and inference deployment technologies aligned with this emerging trend. The discussion on training includes various aspects, including data preprocessing, training architecture, pre-training tasks, parallel training, and relevant content related to model fine-tuning. On the inference side, the paper covers topics such as model compression, parallel computation, memory scheduling, and structural optimization. It also explores LLMs' utilization and provides insights into their future development.
Diffusion models (DMs) have shown great potential for high-quality image synthesis. However, when it comes to producing images with complex scenes, how to properly describe both image global structures and object details remains a challenging task. In this paper, we present Frido, a Feature Pyramid Diffusion model performing a multi-scale coarse-to-fine denoising process for image synthesis. Our model decomposes an input image into scale-dependent vector quantized features, followed by a coarse-to-fine gating for producing image output. During the above multi-scale representation learning stage, additional input conditions like text, scene graph, or image layout can be further exploited. Thus, Frido can be also applied for conditional or cross-modality image synthesis. We conduct extensive experiments over various unconditioned and conditional image generation tasks, ranging from text-to-image synthesis, layout-to-image, scene-graph-to-image, to label-to-image. More specifically, we achieved state-of-the-art FID scores on five benchmarks, namely layout-to-image on COCO and OpenImages, scene-graph-to-image on COCO and Visual Genome, and label-to-image on COCO. Code is available at //github.com/davidhalladay/Frido.