As Vision Transformers (ViTs) increasingly set new benchmarks in computer vision, their practical deployment on inference engines is often hindered by their significant memory bandwidth and (on-chip) memory footprint requirements. This paper addresses this memory limitation by introducing an activation-aware model compression methodology that uses selective low-rank weight tensor approximations of different layers to reduce the parameter count of ViTs. The key idea is to decompose the weight tensors into a sum of two parameter-efficient tensors while minimizing the error between the product of the input activations with the original weight tensor and the product of the input activations with the approximate tensor sum. This approximation is further refined by adopting an efficient layer-wise error compensation technique that uses the gradient of the layer's output loss. The combination of these techniques achieves excellent results while it avoids being trapped in a shallow local minimum early in the optimization process and strikes a good balance between the model compression and output accuracy. Notably, the presented method significantly reduces the parameter count of DeiT-B by 60% with less than 1% accuracy drop on the ImageNet dataset, overcoming the usual accuracy degradation seen in low-rank approximations. In addition to this, the presented compression technique can compress large DeiT/ViT models to have about the same model size as smaller DeiT/ViT variants while yielding up to 1.8% accuracy gain. These results highlight the efficacy of our approach, presenting a viable solution for embedding ViTs in memory-constrained environments without compromising their performance.
The All-Pairs Shortest Paths (APSP) problem is one of the fundamental problems in theoretical computer science. It asks to compute the distance matrix of a given $n$-vertex graph. We revisit the classical problem of maintaining the distance matrix under a fully dynamic setting undergoing vertex insertions and deletions with a fast worst-case running time and efficient space usage. Although an algorithm with amortized update-time $\tilde O(n ^ 2)$ has been known for nearly two decades [Demetrescu and Italiano, STOC 2003], the current best algorithm for worst-case running time with efficient space usage runs is due to [Gutenberg and Wulff-Nilsen, SODA 2020], which improves the space usage of the previous algorithm due to [Abraham, Chechik, and Krinninger, SODA 2017] to $\tilde O(n ^ 2)$ but fails to improve their running time of $\tilde O(n ^ {2 + 2 / 3})$. It has been conjectured that no algorithm in $O(n ^ {2.5 - \epsilon})$ worst-case update time exists. For graphs without negative cycles, we meet this conjectured lower bound by introducing a Monte Carlo algorithm running in randomized $\tilde O(n ^ {2.5})$ time while keeping the $\tilde O(n ^ 2)$ space bound from the previous algorithm. Our breakthrough is made possible by the idea of ``hop-dominant shortest paths,'' which are shortest paths with a constraint on hops (number of vertices) that remain shortest after we relax the constraint by a constant factor.
Large language models (LLMs) have demonstrated impressive zero-shot abilities on a variety of open-ended tasks, while recent research has also explored the use of LLMs for multi-modal generation. In this study, we introduce mPLUG-Owl, a novel training paradigm that equips LLMs with multi-modal abilities through modularized learning of foundation LLM, a visual knowledge module, and a visual abstractor module. This approach can support multiple modalities and facilitate diverse unimodal and multimodal abilities through modality collaboration. The training paradigm of mPLUG-Owl involves a two-stage method for aligning image and text, which learns visual knowledge with the assistance of LLM while maintaining and even improving the generation abilities of LLM. In the first stage, the visual knowledge module and abstractor module are trained with a frozen LLM module to align the image and text. In the second stage, language-only and multi-modal supervised datasets are used to jointly fine-tune a low-rank adaption (LoRA) module on LLM and the abstractor module by freezing the visual knowledge module. We carefully build a visually-related instruction evaluation set OwlEval. Experimental results show that our model outperforms existing multi-modal models, demonstrating mPLUG-Owl's impressive instruction and visual understanding ability, multi-turn conversation ability, and knowledge reasoning ability. Besides, we observe some unexpected and exciting abilities such as multi-image correlation and scene text understanding, which makes it possible to leverage it for harder real scenarios, such as vision-only document comprehension. Our code, pre-trained model, instruction-tuned models, and evaluation set are available at //github.com/X-PLUG/mPLUG-Owl. The online demo is available at //www.modelscope.cn/studios/damo/mPLUG-Owl.
This work explores the zero-shot capabilities of foundation models in Visual Question Answering (VQA) tasks. We propose an adaptive multi-agent system, named Multi-Agent VQA, to overcome the limitations of foundation models in object detection and counting by using specialized agents as tools. Unlike existing approaches, our study focuses on the system's performance without fine-tuning it on specific VQA datasets, making it more practical and robust in the open world. We present preliminary experimental results under zero-shot scenarios and highlight some failure cases, offering new directions for future research.
Speech recognition and translation systems perform poorly on noisy inputs, which are frequent in realistic environments. Augmenting these systems with visual signals has the potential to improve robustness to noise. However, audio-visual (AV) data is only available in limited amounts and for fewer languages than audio-only resources. To address this gap, we present XLAVS-R, a cross-lingual audio-visual speech representation model for noise-robust speech recognition and translation in over 100 languages. It is designed to maximize the benefits of limited multilingual AV pre-training data, by building on top of audio-only multilingual pre-training and simplifying existing pre-training schemes. Extensive evaluation on the MuAViC benchmark shows the strength of XLAVS-R on downstream audio-visual speech recognition and translation tasks, where it outperforms the previous state of the art by up to 18.5% WER and 4.7 BLEU given noisy AV inputs, and enables strong zero-shot audio-visual ability with audio-only fine-tuning.
Mobile edge computing (MEC) has been regarded as a promising technique to support latencysensitivity and computation-intensive serves. However, the low offloading rate caused by the random channel fading characteristic becomes a major bottleneck in restricting the performance of the MEC. Fortunately, reconfigurable intelligent surface (RIS) can alleviate this problem since it can boost both the spectrum- and energy- efficiency. Different from the existing works adopting either fully active or fully passive RIS, we propose a novel hybrid RIS in which reflecting units can flexibly switch between active and passive modes. To achieve a tradeoff between the latency and energy consumption, an optimization problem is formulated by minimizing the total cost. In light of the intractability of the problem, we develop an alternating optimization-based iterative algorithm by combining the successive convex approximation method, the variable substitution, and the singular value decomposition (SVD) to obtain sub-optimal solutions. Furthermore, in order to gain more insight into the problem, we consider two special cases involving a latency minimization problem and an energy consumption minimization problem, and respectively analyze the tradeoff between the number of active and passive units. Simulation results verify that the proposed algorithm can achieve flexible mode switching and significantly outperforms existing algorithms.
Unsupervised Domain Adaptation (UDA) can effectively address domain gap issues in real-world image Super-Resolution (SR) by accessing both the source and target data. Considering privacy policies or transmission restrictions of source data in practical scenarios, we propose a SOurce-free Domain Adaptation framework for image SR (SODA-SR) to address this issue, i.e., adapt a source-trained model to a target domain with only unlabeled target data. SODA-SR leverages the source-trained model to generate refined pseudo-labels for teacher-student learning. To better utilize pseudo-labels, we propose a novel wavelet-based augmentation method, named Wavelet Augmentation Transformer (WAT), which can be flexibly incorporated with existing networks, to implicitly produce useful augmented data. WAT learns low-frequency information of varying levels across diverse samples, which is aggregated efficiently via deformable attention. Furthermore, an uncertainty-aware self-training mechanism is proposed to improve the accuracy of pseudo-labels, with inaccurate predictions being rectified by uncertainty estimation. To acquire better SR results and avoid overfitting pseudo-labels, several regularization losses are proposed to constrain target LR and SR images in the frequency domain. Experiments show that without accessing source data, SODA-SR outperforms state-of-the-art UDA methods in both synthetic$\rightarrow$real and real$\rightarrow$real adaptation settings, and is not constrained by specific network architectures.
Network calculus (NC), particularly its min-plus branch, has been extensively utilized to construct service models and compute delay bounds for time-sensitive networks (TSNs). This paper provides a revisit to the fundamental results. In particular, counterexamples to the most basic min-plus service models, which have been proposed for TSNs and used for computing delay bounds, indicate that the packetization effect has often been overlooked. To address, the max-plus branch of NC is also considered in this paper, whose models handle packetized traffic more explicitly. It is found that mapping the min-plus models to the max-plus models may bring in an immediate improvement over delay bounds derived from the min-plus analysis. In addition, an integrated analytical approach that combines models from both the min-plus and the max-plus NC branches is introduced. In this approach, the max-plus $g$-server model is extended and the extended model, called $g^{x}$-server, is used together with the min-plus arrival curve traffic model. By applying the integrated NC approach, service and delay bounds are derived for several settings that are fundamental in TSNs.
Graph Neural Networks (GNNs) have shown promising performance in various graph learning tasks, but at the cost of resource-intensive computations. The primary overhead of GNN update stems from graph propagation and weight transformation, both involving operations on graph-scale matrices. Previous studies attempt to reduce the computational budget by leveraging graph-level or network-level sparsification techniques, resulting in downsized graph or weights. In this work, we propose Unifews, which unifies the two operations in an entry-wise manner considering individual matrix elements, and conducts joint edge-weight sparsification to enhance learning efficiency. The entry-wise design of Unifews enables adaptive compression across GNN layers with progressively increased sparsity, and is applicable to a variety of architectural designs with on-the-fly operation simplification. Theoretically, we establish a novel framework to characterize sparsified GNN learning in view of a graph optimization process, and prove that Unifews effectively approximates the learning objective with bounded error and reduced computational load. We conduct extensive experiments to evaluate the performance of our method in diverse settings. Unifews is advantageous in jointly removing more than 90% of edges and weight entries with comparable or better accuracy than baseline models. The sparsification offers remarkable efficiency improvements including 10-20x matrix operation reduction and up to 100x acceleration in graph propagation time for the largest graph at the billion-edge scale.
Recent advancements in artificial intelligence (AI), especially large language models (LLMs), have significantly advanced healthcare applications and demonstrated potentials in intelligent medical treatment. However, there are conspicuous challenges such as vast data volumes and inconsistent symptom characterization standards, preventing full integration of healthcare AI systems with individual patients' needs. To promote professional and personalized healthcare, we propose an innovative framework, Heath-LLM, which combines large-scale feature extraction and medical knowledge trade-off scoring. Compared to traditional health management applications, our system has three main advantages: (1) It integrates health reports and medical knowledge into a large model to ask relevant questions to large language model for disease prediction; (2) It leverages a retrieval augmented generation (RAG) mechanism to enhance feature extraction; (3) It incorporates a semi-automated feature updating framework that can merge and delete features to improve accuracy of disease prediction. We experiment on a large number of health reports to assess the effectiveness of Health-LLM system. The results indicate that the proposed system surpasses the existing ones and has the potential to significantly advance disease prediction and personalized health management. The code is available at //github.com/jmyissb/HealthLLM.
We propose a knowledge-enhanced approach, ERNIE-ViL, to learn joint representations of vision and language. ERNIE-ViL tries to construct the detailed semantic connections (objects, attributes of objects and relationships between objects in visual scenes) across vision and language, which are essential to vision-language cross-modal tasks. Incorporating knowledge from scene graphs, ERNIE-ViL constructs Scene Graph Prediction tasks, i.e., Object Prediction, Attribute Prediction and Relationship Prediction in the pre-training phase. More specifically, these prediction tasks are implemented by predicting nodes of different types in the scene graph parsed from the sentence. Thus, ERNIE-ViL can model the joint representation characterizing the alignments of the detailed semantics across vision and language. Pre-trained on two large image-text alignment datasets (Conceptual Captions and SBU), ERNIE-ViL learns better and more robust joint representations. It achieves state-of-the-art performance on 5 vision-language downstream tasks after fine-tuning ERNIE-ViL. Furthermore, it ranked the 1st place on the VCR leader-board with an absolute improvement of 3.7%.