Multicasting is a vital information dissemination technique in Software-Defined Networking (SDN). With SDN, a multicast service can incorporate network functions implemented at different nodes, which is referred to as software-defined multicast. Emerging ubiquitous wireless networks for 5G and Beyond (B5G) inherently support multicast. However, the broadcast nature of wireless channels, especially in dense deployments, leads to neighborhood interference as a primary system degradation factor, which introduces a new challenge for software-defined multicast in wireless mesh networks. To tackle this, this paper introduces a novel approach, based on the idea of minimizing both the total length cost of the multicast tree and the interference at the same time. Accordingly, a bicriteria optimization problem is formulated, which is called \emph{Minimum Interference Steiner Tree (MIST)}. To solve the bicriteria problem, instead of resorting to heuristics, this paper employs an innovative approach that is an approximate algorithm for MIST but with guaranteed performance. Specifically, the approach is a two-stage relaxation algorithm by exploiting the monotone submodularity property of the interference metric and identifying Pareto optimal solutions for MIST. Simulation results demonstrate and validate the performance of the proposed algorithm.
Instruction finetuning on a variety of image-text instruction data is the key to obtaining a versatile Multimodal Large Language Model (MLLM), and different configurations of the instruction data can lead to finetuned models with different capabilities. However, we have discovered that data conflicts are inevitable when mixing instruction data from distinct domains, which can result in performance drops for tasks of a specific domain. To address this issue, we propose to apply a sparse mixture of LoRA experts for instruction finetuning MLLMs. Within the Transformer layers, we extend the popular Low-Rank Adaption (LoRA) method by creating a set of LoRA experts specifically for the MLP layer, and route each token to the top-1 expert based on a routing function, allowing adaptive choices for tokens from different domains. Since the LoRA experts are sparsely activated, the training and inference cost are kept roughly constant compared to the original LoRA method. By replacing the plain-LoRA finetuing of LLaVA-1.5, our final model is named LLaVA-MoLE. Extensive experiments proved that LLaVA-MoLE effectively mitigates the data conflict issue when mixing multiple distinct instruction datasets with various configurations, and achieves consistent performance gains over the strong plain-LoRA baselines. Most importantly, on the mixed datasets, LLaVA-MoLE can even outperform the plain-LoRA baseline trained with twice the samples.
Recently, foundational models such as CLIP and SAM have shown promising performance for the task of Zero-Shot Anomaly Segmentation (ZSAS). However, either CLIP-based or SAM-based ZSAS methods still suffer from non-negligible key drawbacks: 1) CLIP primarily focuses on global feature alignment across different inputs, leading to imprecise segmentation of local anomalous parts; 2) SAM tends to generate numerous redundant masks without proper prompt constraints, resulting in complex post-processing requirements. In this work, we innovatively propose a CLIP and SAM collaboration framework called ClipSAM for ZSAS. The insight behind ClipSAM is to employ CLIP's semantic understanding capability for anomaly localization and rough segmentation, which is further used as the prompt constraints for SAM to refine the anomaly segmentation results. In details, we introduce a crucial Unified Multi-scale Cross-modal Interaction (UMCI) module for interacting language with visual features at multiple scales of CLIP to reason anomaly positions. Then, we design a novel Multi-level Mask Refinement (MMR) module, which utilizes the positional information as multi-level prompts for SAM to acquire hierarchical levels of masks and merges them. Extensive experiments validate the effectiveness of our approach, achieving the optimal segmentation performance on the MVTec-AD and VisA datasets.
Multimodal Large Language Models (MLLMs) demonstrate impressive image understanding and generating capabilities. However, existing benchmarks employ limited charts that deviate from real-world scenarios, posing challenges in accurately assessing the chart comprehension of MLLMs. To overcome this constraint, we propose ChartBench, an exhaustive chart benchmark specifically designed to evaluate MLLMs' chart comprehension and data reliability through complex visual reasoning. ChartBench encompasses a wide spectrum, including 42 categories, 2.1K charts, and 16.8K question-answer pairs. Diverging from previous benchmarks, ChartBench avoids employing data point annotation charts or metadata prompts directly. Instead, it compels MLLMs to derive values akin to human understanding by leveraging inherent chart elements such as color, legends, or coordinate systems. Additionally, we propose an enhanced evaluation metric, Acc+, which facilitates the evaluation of MLLMs without needing labor-intensive manual efforts or costly evaluations based on GPT. Our extensive experimental evaluation involves 12 widely-used open-sourced and 2 proprietary MLLMs, revealing the limitations of MLLMs in interpreting charts and providing valuable insights to encourage closer scrutiny of this aspect.
Linguistic Steganography (LS) tasks aim to generate steganographic texts (stego) based on secret information. Only authorized recipients can perceive the existence of secret information in the texts and accurately extract it, thereby preserving privacy. However, the controllability of the stego generated by existing schemes is poor, and the generated stego is difficult to contain specific discourse characteristics such as style, genre, and theme. As a result, the stego are often easily detectable, compromising covert communication. To address these problems, this paper proposes a novel scheme named LLsM, a generative LS based on a Large Language Model (LLM). We fine-tuned the LLM LLaMA2 with a large-scale constructed dataset encompassing rich discourse characteristics, which enables the fine-tuned LLM to generate texts with specific discourse in a controllable manner. Then the discourse characteristics are used as guiding information and inputted into the fine-tuned LLM in the form of Prompt together with secret information. The candidate pool, derived from sampling and truncation, undergoes range encoding to ensure the stego imitate natural text distribution. Experiments demonstrate that LLsM performs superior to prevalent baselines regarding text quality, statistical analysis, discourse matching, and anti-steganalysis. In particular, LLsM's MAUVE surpasses that of some baselines by 70%-80%, and its anti-steganalysis performance is 30%-40% higher. Notably, we also present the long stego generated by LLsM, showing its potential superiority in long LS tasks.
Recently, Graph Neural Network (GNN)-based vulnerability detection systems have achieved remarkable success. However, the lack of explainability poses a critical challenge to deploy black-box models in security-related domains. For this reason, several approaches have been proposed to explain the decision logic of the detection model by providing a set of crucial statements positively contributing to its predictions. Unfortunately, due to the weakly-robust detection models and suboptimal explanation strategy, they have the danger of revealing spurious correlations and redundancy issue. In this paper, we propose Coca, a general framework aiming to 1) enhance the robustness of existing GNN-based vulnerability detection models to avoid spurious explanations; and 2) provide both concise and effective explanations to reason about the detected vulnerabilities. \sysname consists of two core parts referred to as Trainer and Explainer. The former aims to train a detection model which is robust to random perturbation based on combinatorial contrastive learning, while the latter builds an explainer to derive crucial code statements that are most decisive to the detected vulnerability via dual-view causal inference as explanations. We apply Coca over three typical GNN-based vulnerability detectors. Experimental results show that Coca can effectively mitigate the spurious correlation issue, and provide more useful high-quality explanations.
This paper presents ServerlessLLM, a locality-enhanced serverless inference system for Large Language Models (LLMs). ServerlessLLM exploits the substantial capacity and bandwidth of storage and memory devices available on GPU servers, thereby reducing costly remote checkpoint downloads and achieving efficient checkpoint loading. ServerlessLLM achieves this through three main contributions: (i) fast LLM checkpoint loading via a novel loading-optimized checkpoint format design, coupled with an efficient multi-tier checkpoint loading system; (ii) locality-driven LLM inference with live migration, which allows ServerlessLLM to effectively achieve locality-driven server allocation while preserving the low latency of ongoing LLM inference; and (iii) locality-aware server allocation, enabling ServerlessLLM to evaluate the status of each server in a cluster and effectively schedule model startup time to capitalize on local checkpoint placement. Our comprehensive experiments, which include microbenchmarks and real-world traces, show that ServerlessLLM surpasses state-of-the-art systems by 10 - 200X in latency performance when running various LLM inference workloads.
Linear solvers are major computational bottlenecks in a wide range of decision support and optimization computations. The challenges become even more pronounced on heterogeneous hardware, where traditional sparse numerical linear algebra methods are often inefficient. For example, methods for solving ill-conditioned linear systems have relied on conditional branching, which degrades performance on hardware accelerators such as graphical processing units (GPUs). To improve the efficiency of solving ill-conditioned systems, our computational strategy separates computations that are efficient on GPUs from those that need to run on traditional central processing units (CPUs). Our strategy maximizes the reuse of expensive CPU computations. Iterative methods, which thus far have not been broadly used for ill-conditioned linear systems, play an important role in our approach. In particular, we extend ideas from [1] to implement iterative refinement using inexact LU factors and flexible generalized minimal residual (FGMRES), with the aim of efficient performance on GPUs. We focus on solutions that are effective within broader application contexts, and discuss how early performance tests could be improved to be more predictive of the performance in a realistic environment
Diffusion models (DMs) have shown great potential for high-quality image synthesis. However, when it comes to producing images with complex scenes, how to properly describe both image global structures and object details remains a challenging task. In this paper, we present Frido, a Feature Pyramid Diffusion model performing a multi-scale coarse-to-fine denoising process for image synthesis. Our model decomposes an input image into scale-dependent vector quantized features, followed by a coarse-to-fine gating for producing image output. During the above multi-scale representation learning stage, additional input conditions like text, scene graph, or image layout can be further exploited. Thus, Frido can be also applied for conditional or cross-modality image synthesis. We conduct extensive experiments over various unconditioned and conditional image generation tasks, ranging from text-to-image synthesis, layout-to-image, scene-graph-to-image, to label-to-image. More specifically, we achieved state-of-the-art FID scores on five benchmarks, namely layout-to-image on COCO and OpenImages, scene-graph-to-image on COCO and Visual Genome, and label-to-image on COCO. Code is available at //github.com/davidhalladay/Frido.
With the extremely rapid advances in remote sensing (RS) technology, a great quantity of Earth observation (EO) data featuring considerable and complicated heterogeneity is readily available nowadays, which renders researchers an opportunity to tackle current geoscience applications in a fresh way. With the joint utilization of EO data, much research on multimodal RS data fusion has made tremendous progress in recent years, yet these developed traditional algorithms inevitably meet the performance bottleneck due to the lack of the ability to comprehensively analyse and interpret these strongly heterogeneous data. Hence, this non-negligible limitation further arouses an intense demand for an alternative tool with powerful processing competence. Deep learning (DL), as a cutting-edge technology, has witnessed remarkable breakthroughs in numerous computer vision tasks owing to its impressive ability in data representation and reconstruction. Naturally, it has been successfully applied to the field of multimodal RS data fusion, yielding great improvement compared with traditional methods. This survey aims to present a systematic overview in DL-based multimodal RS data fusion. More specifically, some essential knowledge about this topic is first given. Subsequently, a literature survey is conducted to analyse the trends of this field. Some prevalent sub-fields in the multimodal RS data fusion are then reviewed in terms of the to-be-fused data modalities, i.e., spatiospectral, spatiotemporal, light detection and ranging-optical, synthetic aperture radar-optical, and RS-Geospatial Big Data fusion. Furthermore, We collect and summarize some valuable resources for the sake of the development in multimodal RS data fusion. Finally, the remaining challenges and potential future directions are highlighted.
Deep neural networks (DNNs) are successful in many computer vision tasks. However, the most accurate DNNs require millions of parameters and operations, making them energy, computation and memory intensive. This impedes the deployment of large DNNs in low-power devices with limited compute resources. Recent research improves DNN models by reducing the memory requirement, energy consumption, and number of operations without significantly decreasing the accuracy. This paper surveys the progress of low-power deep learning and computer vision, specifically in regards to inference, and discusses the methods for compacting and accelerating DNN models. The techniques can be divided into four major categories: (1) parameter quantization and pruning, (2) compressed convolutional filters and matrix factorization, (3) network architecture search, and (4) knowledge distillation. We analyze the accuracy, advantages, disadvantages, and potential solutions to the problems with the techniques in each category. We also discuss new evaluation metrics as a guideline for future research.