Artificial intelligence (AI) systems possess significant potential to drive societal progress. However, their deployment often faces obstacles due to substantial safety concerns. Safe reinforcement learning (SafeRL) emerges as a solution to optimize policies while simultaneously adhering to multiple constraints, thereby addressing the challenge of integrating reinforcement learning in safety-critical scenarios. In this paper, we present an environment suite called Safety-Gymnasium, which encompasses safety-critical tasks in both single and multi-agent scenarios, accepting vector and vision-only input. Additionally, we offer a library of algorithms named Safe Policy Optimization (SafePO), comprising 16 state-of-the-art SafeRL algorithms. This comprehensive library can serve as a validation tool for the research community. By introducing this benchmark, we aim to facilitate the evaluation and comparison of safety performance, thus fostering the development of reinforcement learning for safer, more reliable, and responsible real-world applications. The website of this project can be accessed at //sites.google.com/view/safety-gymnasium.
By decoupling substrate resources, network virtualization (NV) is a promising solution for meeting diverse demands and ensuring differentiated quality of service (QoS). In particular, virtual network embedding (VNE) is a critical enabling technology that enhances the flexibility and scalability of network deployment by addressing the coupling of Internet processes and services. However, in the existing works, the black-box nature of deep neural networks (DNNs) limits the analysis, development, and improvement of systems. In recent times, interpretable deep learning (DL) represented by deep neuro-fuzzy systems (DNFS) combined with fuzzy inference has shown promising interpretability to further exploit the hidden value in the data. Motivated by this, we propose a DNFS-based VNE algorithm that aims to provide an interpretable NV scheme. Specifically, data-driven convolutional neural networks (CNNs) are used as fuzzy implication operators to compute the embedding probabilities of candidate substrate nodes through entailment operations. And, the identified fuzzy rule patterns are cached into the weights by forward computation and gradient back-propagation (BP). Moreover, the fuzzy rule base is constructed based on Mamdani-type linguistic rules using linguistic labels. In addition, the DNFS-driven five-block structure-based policy network serves as the agent for deep reinforcement learning (DRL), which optimizes VNE decision-making through interaction with the environment. Finally, the effectiveness of evaluation indicators and fuzzy rules is verified by experiments.
While recent years have seen rapid progress in image-conditioned text generation, image captioning still suffers from the fundamental issue of hallucinations, the generation of spurious details that cannot be inferred from the given image. Dedicated methods for reducing hallucinations in image captioning largely focus on closed-vocabulary object tokens, ignoring most types of hallucinations that occur in practice. In this work, we propose MOCHa, an approach that harnesses advancements in reinforcement learning (RL) to address the sequence-level nature of hallucinations in an open-world setup. To optimize for caption fidelity to the input image, we leverage ground-truth reference captions as proxies to measure the logical consistency of generated captions. However, optimizing for caption fidelity alone fails to preserve the semantic adequacy of generations; therefore, we propose a multi-objective reward function that jointly targets these qualities, without requiring any strong supervision. We demonstrate that these goals can be simultaneously optimized with our framework, enhancing performance for various captioning models of different scales. Our qualitative and quantitative results demonstrate MOCHa's superior performance across various established metrics. We also demonstrate the benefit of our method in the open-vocabulary setting. To this end, we contribute OpenCHAIR, a new benchmark for quantifying open-vocabulary hallucinations in image captioning models, constructed using generative foundation models. We will release our code, benchmark, and trained models.
This work utilizes a variational autoencoder for channel estimation and evaluates it on real-world measurements. The estimator is trained solely on noisy channel observations and parameterizes an approximation to the mean squared error-optimal estimator by learning observation-dependent conditional first and second moments. The proposed estimator significantly outperforms related state-of-the-art estimators on real-world measurements. We investigate the effect of pre-training with synthetic data and find that the proposed estimator exhibits comparable results to the related estimators if trained on synthetic data and evaluated on the measurement data. Furthermore, pre-training on synthetic data also helps to reduce the required measurement training dataset size.
This work introduces DADAO: the first decentralized, accelerated, asynchronous, primal, first-order algorithm to minimize a sum of $L$-smooth and $\mu$-strongly convex functions distributed over a given network of size $n$. Our key insight is based on modeling the local gradient updates and gossip communication procedures with separate independent Poisson Point Processes. This allows us to decouple the computation and communication steps, which can be run in parallel, while making the whole approach completely asynchronous. This leads to communication acceleration compared to synchronous approaches. Our new method employs primal gradients and does not use a multi-consensus inner loop nor other ad-hoc mechanisms such as Error Feedback, Gradient Tracking, or a Proximal operator. By relating the inverse of the smallest positive eigenvalue of the Laplacian matrix $\chi_1$ and the maximal resistance $\chi_2\leq \chi_1$ of the graph to a sufficient minimal communication rate between the nodes of the network, we show that our algorithm requires $\mathcal{O}(n\sqrt{\frac{L}{\mu}}\log(\frac{1}{\epsilon}))$ local gradients and only $\mathcal{O}(n\sqrt{\chi_1\chi_2}\sqrt{\frac{L}{\mu}}\log(\frac{1}{\epsilon}))$ communications to reach a precision $\epsilon$, up to logarithmic terms. Thus, we simultaneously obtain an accelerated rate for both computations and communications, leading to an improvement over state-of-the-art works, our simulations further validating the strength of our relatively unconstrained method.
Recently, sparse 3D convolutions have changed 3D object detection. Performing on par with the voting-based approaches, 3D CNNs are memory-efficient and scale to large scenes better. However, there is still room for improvement. With a conscious, practice-oriented approach to problem-solving, we analyze the performance of such methods and localize the weaknesses. Applying modifications that resolve the found issues one by one, we end up with TR3D: a fast fully-convolutional 3D object detection model trained end-to-end, that achieves state-of-the-art results on the standard benchmarks, ScanNet v2, SUN RGB-D, and S3DIS. Moreover, to take advantage of both point cloud and RGB inputs, we introduce an early fusion of 2D and 3D features. We employ our fusion module to make conventional 3D object detection methods multimodal and demonstrate an impressive boost in performance. Our model with early feature fusion, which we refer to as TR3D+FF, outperforms existing 3D object detection approaches on the SUN RGB-D dataset. Overall, besides being accurate, both TR3D and TR3D+FF models are lightweight, memory-efficient, and fast, thereby marking another milestone on the way toward real-time 3D object detection. Code is available at //github.com/SamsungLabs/tr3d .
The recent success of Large Language Models (LLMs) signifies an impressive stride towards artificial general intelligence. They have shown a promising prospect in automatically completing tasks upon user instructions, functioning as brain-like coordinators. The associated risks will be revealed as we delegate an increasing number of tasks to machines for automated completion. A big question emerges: how can we make machines behave responsibly when helping humans automate tasks as personal copilots? In this paper, we explore this question in depth from the perspectives of feasibility, completeness and security. In specific, we present Responsible Task Automation (ResponsibleTA) as a fundamental framework to facilitate responsible collaboration between LLM-based coordinators and executors for task automation with three empowered capabilities: 1) predicting the feasibility of the commands for executors; 2) verifying the completeness of executors; 3) enhancing the security (e.g., the protection of users' privacy). We further propose and compare two paradigms for implementing the first two capabilities. One is to leverage the generic knowledge of LLMs themselves via prompt engineering while the other is to adopt domain-specific learnable models. Moreover, we introduce a local memory mechanism for achieving the third capability. We evaluate our proposed ResponsibleTA on UI task automation and hope it could bring more attentions to ensuring LLMs more responsible in diverse scenarios.
Efficient inference of Deep Neural Networks (DNNs) on resource-constrained edge devices is essential. Quantization and sparsity are key algorithmic techniques that translate to repetition and sparsity within tensors at the hardware-software interface. This paper introduces the concept of repetition-sparsity trade-off that helps explain computational efficiency during inference. We propose Signed Binarization, a unified co-design framework that synergistically integrates hardware-software systems, quantization functions, and representation learning techniques to address this trade-off. Our results demonstrate that Signed Binarization is more accurate than binarization with the same number of non-zero weights. Detailed analysis indicates that signed binarization generates a smaller distribution of effectual (non-zero) parameters nested within a larger distribution of total parameters, both of the same type, for a DNN block. Finally, our approach achieves a 26% speedup on real hardware, doubles energy efficiency, and reduces density by 2.8x compared to binary methods for ResNet 18, presenting an alternative solution for deploying efficient models in resource-limited environments.
Artificial intelligence (AI) technologies have emerged as pivotal enablers across a multitude of industries, including consumer electronics, healthcare, and manufacturing, largely due to their resurgence over the past decade. The transformative power of AI is primarily derived from the utilization of deep neural networks (DNNs), which require extensive data for training and substantial computational resources for processing. Consequently, DNN models are typically trained and deployed on resource-rich cloud servers. However, due to potential latency issues associated with cloud communications, deep learning (DL) workflows are increasingly being transitioned to wireless edge networks near end-user devices (EUDs). This shift is designed to support latency-sensitive applications and has given rise to a new paradigm of edge AI, which will play a critical role in upcoming 6G networks to support ubiquitous AI applications. Despite its potential, edge AI faces substantial challenges, mostly due to the dichotomy between the resource limitations of wireless edge networks and the resource-intensive nature of DL. Specifically, the acquisition of large-scale data, as well as the training and inference processes of DNNs, can rapidly deplete the battery energy of EUDs. This necessitates an energy-conscious approach to edge AI to ensure both optimal and sustainable performance. In this paper, we present a contemporary survey on green edge AI. We commence by analyzing the principal energy consumption components of edge AI systems to identify the fundamental design principles of green edge AI. Guided by these principles, we then explore energy-efficient design methodologies for the three critical tasks in edge AI systems, including training data acquisition, edge training, and edge inference. Finally, we underscore potential future research directions to further enhance the energy efficiency of edge AI.
With the recent development of generative models, Text-to-3D generations have also seen significant growth. Nonetheless, achieving precise control over 3D generation continues to be an arduous task, as using text to control often leads to missing objects and imprecise locations. Contemporary strategies for enhancing controllability in 3D generation often entail the introduction of additional parameters, such as customized diffusion models. This often induces hardness in adapting to different diffusion models or creating distinct objects. In this paper, we present LucidDreaming as an effective pipeline capable of fine-grained control over 3D generation. It requires only minimal input of 3D bounding boxes, which can be deduced from a simple text prompt using a Large Language Model. Specifically, we propose clipped ray sampling to separately render and optimize objects with user specifications. We also introduce object-centric density blob bias, fostering the separation of generated objects. With individual rendering and optimizing of objects, our method excels not only in controlled content generation from scratch but also within the pre-trained NeRF scenes. In such scenarios, existing generative approaches often disrupt the integrity of the original scene, and current editing methods struggle to synthesize new content in empty spaces. We show that our method exhibits remarkable adaptability across a spectrum of mainstream Score Distillation Sampling-based 3D generation frameworks, and achieves superior alignment of 3D content when compared to baseline approaches. We also provide a dataset of prompts with 3D bounding boxes, benchmarking 3D spatial controllability.
Deep learning on graphs has attracted significant interests recently. However, most of the works have focused on (semi-) supervised learning, resulting in shortcomings including heavy label reliance, poor generalization, and weak robustness. To address these issues, self-supervised learning (SSL), which extracts informative knowledge through well-designed pretext tasks without relying on manual labels, has become a promising and trending learning paradigm for graph data. Different from SSL on other domains like computer vision and natural language processing, SSL on graphs has an exclusive background, design ideas, and taxonomies. Under the umbrella of graph self-supervised learning, we present a timely and comprehensive review of the existing approaches which employ SSL techniques for graph data. We construct a unified framework that mathematically formalizes the paradigm of graph SSL. According to the objectives of pretext tasks, we divide these approaches into four categories: generation-based, auxiliary property-based, contrast-based, and hybrid approaches. We further conclude the applications of graph SSL across various research fields and summarize the commonly used datasets, evaluation benchmark, performance comparison and open-source codes of graph SSL. Finally, we discuss the remaining challenges and potential future directions in this research field.