Contemporary real-time video communication systems, such as WebRTC, use an adaptive bitrate (ABR) algorithm to assure high-quality and low-delay services, e.g., promptly adjusting video bitrate according to the instantaneous network bandwidth. However, target bitrate decisions in the network and bitrate control in the codec are typically incoordinated and simply ignoring the effect of inappropriate resolution and frame rate settings also leads to compromised results in bitrate control, thus devastatingly deteriorating the quality of experience (QoE). To tackle these challenges, Mamba, an end-to-end multi-dimensional ABR algorithm is proposed, which utilizes multi-agent reinforcement learning (MARL) to maximize the user's QoE by adaptively and collaboratively adjusting encoding factors including the quantization parameters (QP), resolution, and frame rate based on observed states such as network conditions and video complexity information in a video conferencing system. We also introduce curriculum learning to improve the training efficiency of MARL. Both the in-lab and real-world evaluation results demonstrate the remarkable efficacy of Mamba.
Neural radiance fields (NeRFs) enable high-quality novel view synthesis, but their prohibitively high computational complexity limits deployability, especially on resource-constrained platforms. To enable practical usage of NeRFs, quality tuning is essential to reduce computational complexity, akin to adjustable graphics settings in video games. However while existing solutions strive for efficiency, they use one-size-fits-all architectures regardless of scene complexity, although the same architecture may be unnecessarily large for simple scenes but insufficient for complex ones. Thus as NeRFs become more widely used for 3D visualization, there is a need to dynamically optimize the neural network component of NeRFs to achieve a balance between computational complexity and specific targets for synthesis quality. Addressing this gap, we introduce NAS-NeRF: a generative neural architecture search strategy uniquely tailored to generate NeRF architectures on a per-scene basis by optimizing the trade-off between complexity and performance, while adhering to constraints on computational budget and minimum synthesis quality. Our experiments on the Blender synthetic dataset show the proposed NAS-NeRF can generate architectures up to 5.74$\times$ smaller, with 4.19$\times$ fewer FLOPs, and 1.93$\times$ faster on a GPU than baseline NeRFs, without suffering a drop in SSIM. Furthermore, we illustrate that NAS-NeRF can also achieve architectures up to 23$\times$ smaller, 22$\times$ fewer FLOPs, and 4.7$\times$ faster than baseline NeRFs with only a 5.3\% average SSIM drop. The source code for our work is also made publicly available at //saeejithnair.github.io/NAS-NeRF.
The integration of different modalities, such as audio and visual information, plays a crucial role in human perception of the surrounding environment. Recent research has made significant progress in designing fusion modules for audio-visual speech separation. However, they predominantly focus on multi-modal fusion architectures situated either at the top or bottom positions, rather than comprehensively considering multi-modal fusion at various hierarchical positions within the network. In this paper, we propose a novel model called self- and cross-attention network (SCANet), which leverages the attention mechanism for efficient audio-visual feature fusion. SCANet consists of two types of attention blocks: self-attention (SA) and cross-attention (CA) blocks, where the CA blocks are distributed at the top (TCA), middle (MCA) and bottom (BCA) of SCANet. These blocks maintain the ability to learn modality-specific features and enable the extraction of different semantics from audio-visual features. Comprehensive experiments on three standard audio-visual separation benchmarks (LRS2, LRS3, and VoxCeleb2) demonstrate the effectiveness of SCANet, outperforming existing state-of-the-art (SOTA) methods while maintaining comparable inference time.
Monocular depth estimation (MDE) is a fundamental topic of geometric computer vision and a core technique for many downstream applications. Recently, several methods reframe the MDE as a classification-regression problem where a linear combination of probabilistic distribution and bin centers is used to predict depth. In this paper, we propose a novel concept of iterative elastic bins (IEBins) for the classification-regression-based MDE. The proposed IEBins aims to search for high-quality depth by progressively optimizing the search range, which involves multiple stages and each stage performs a finer-grained depth search in the target bin on top of its previous stage. To alleviate the possible error accumulation during the iterative process, we utilize a novel elastic target bin to replace the original target bin, the width of which is adjusted elastically based on the depth uncertainty. Furthermore, we develop a dedicated framework composed of a feature extractor and an iterative optimizer that has powerful temporal context modeling capabilities benefiting from the GRU-based architecture. Extensive experiments on the KITTI, NYU-Depth-v2 and SUN RGB-D datasets demonstrate that the proposed method surpasses prior state-of-the-art competitors. The source code is publicly available at //github.com/ShuweiShao/IEBins.
The problem of deep long-tailed learning, a prevalent challenge in the realm of generic visual recognition, persists in a multitude of real-world applications. To tackle the heavily-skewed dataset issue in long-tailed classification, prior efforts have sought to augment existing deep models with the elaborate class-balancing strategies, such as class rebalancing, data augmentation, and module improvement. Despite the encouraging performance, the limited class knowledge of the tailed classes in the training dataset still bottlenecks the performance of the existing deep models. In this paper, we propose an innovative long-tailed learning paradigm that breaks the bottleneck by guiding the learning of deep networks with external prior knowledge. This is specifically achieved by devising an elaborated ``prophetic'' teacher, termed as ``Propheter'', that aims to learn the potential class distributions. The target long-tailed prediction model is then optimized under the instruction of the well-trained ``Propheter'', such that the distributions of different classes are as distinguishable as possible from each other. Experiments on eight long-tailed benchmarks across three architectures demonstrate that the proposed prophetic paradigm acts as a promising solution to the challenge of limited class knowledge in long-tailed datasets. The developed code is publicly available at \url{//github.com/tcmyxc/propheter}.
In utilizing large language models (LLMs) for mathematical reasoning, addressing the errors in the reasoning and calculation present in the generated text by LLMs is a crucial challenge. In this paper, we propose a novel framework that integrates the Chain-of-Thought (CoT) method with an external tool (Python REPL). We discovered that by prompting LLMs to generate structured text in XML-like markup language, we could seamlessly integrate CoT and the external tool and control the undesired behaviors of LLMs. With our approach, LLMs can utilize Python computation to rectify errors within CoT. We applied our method to ChatGPT (GPT-3.5) to solve challenging mathematical problems and demonstrated that combining CoT and Python REPL through the markup language enhances the reasoning capability of LLMs. Our approach enables LLMs to write the markup language and perform advanced mathematical reasoning using only zero-shot prompting.
Denoising diffusion models represent a recent emerging topic in computer vision, demonstrating remarkable results in the area of generative modeling. A diffusion model is a deep generative model that is based on two stages, a forward diffusion stage and a reverse diffusion stage. In the forward diffusion stage, the input data is gradually perturbed over several steps by adding Gaussian noise. In the reverse stage, a model is tasked at recovering the original input data by learning to gradually reverse the diffusion process, step by step. Diffusion models are widely appreciated for the quality and diversity of the generated samples, despite their known computational burdens, i.e. low speeds due to the high number of steps involved during sampling. In this survey, we provide a comprehensive review of articles on denoising diffusion models applied in vision, comprising both theoretical and practical contributions in the field. First, we identify and present three generic diffusion modeling frameworks, which are based on denoising diffusion probabilistic models, noise conditioned score networks, and stochastic differential equations. We further discuss the relations between diffusion models and other deep generative models, including variational auto-encoders, generative adversarial networks, energy-based models, autoregressive models and normalizing flows. Then, we introduce a multi-perspective categorization of diffusion models applied in computer vision. Finally, we illustrate the current limitations of diffusion models and envision some interesting directions for future research.
Interpretability methods are developed to understand the working mechanisms of black-box models, which is crucial to their responsible deployment. Fulfilling this goal requires both that the explanations generated by these methods are correct and that people can easily and reliably understand them. While the former has been addressed in prior work, the latter is often overlooked, resulting in informal model understanding derived from a handful of local explanations. In this paper, we introduce explanation summary (ExSum), a mathematical framework for quantifying model understanding, and propose metrics for its quality assessment. On two domains, ExSum highlights various limitations in the current practice, helps develop accurate model understanding, and reveals easily overlooked properties of the model. We also connect understandability to other properties of explanations such as human alignment, robustness, and counterfactual minimality and plausibility.
Images can convey rich semantics and induce various emotions in viewers. Recently, with the rapid advancement of emotional intelligence and the explosive growth of visual data, extensive research efforts have been dedicated to affective image content analysis (AICA). In this survey, we will comprehensively review the development of AICA in the recent two decades, especially focusing on the state-of-the-art methods with respect to three main challenges -- the affective gap, perception subjectivity, and label noise and absence. We begin with an introduction to the key emotion representation models that have been widely employed in AICA and description of available datasets for performing evaluation with quantitative comparison of label noise and dataset bias. We then summarize and compare the representative approaches on (1) emotion feature extraction, including both handcrafted and deep features, (2) learning methods on dominant emotion recognition, personalized emotion prediction, emotion distribution learning, and learning from noisy data or few labels, and (3) AICA based applications. Finally, we discuss some challenges and promising research directions in the future, such as image content and context understanding, group emotion clustering, and viewer-image interaction.
Conventionally, spatiotemporal modeling network and its complexity are the two most concentrated research topics in video action recognition. Existing state-of-the-art methods have achieved excellent accuracy regardless of the complexity meanwhile efficient spatiotemporal modeling solutions are slightly inferior in performance. In this paper, we attempt to acquire both efficiency and effectiveness simultaneously. First of all, besides traditionally treating H x W x T video frames as space-time signal (viewing from the Height-Width spatial plane), we propose to also model video from the other two Height-Time and Width-Time planes, to capture the dynamics of video thoroughly. Secondly, our model is designed based on 2D CNN backbones and model complexity is well kept in mind by design. Specifically, we introduce a novel multi-view fusion (MVF) module to exploit video dynamics using separable convolution for efficiency. It is a plug-and-play module and can be inserted into off-the-shelf 2D CNNs to form a simple yet effective model called MVFNet. Moreover, MVFNet can be thought of as a generalized video modeling framework and it can specialize to be existing methods such as C2D, SlowOnly, and TSM under different settings. Extensive experiments are conducted on popular benchmarks (i.e., Something-Something V1 & V2, Kinetics, UCF-101, and HMDB-51) to show its superiority. The proposed MVFNet can achieve state-of-the-art performance with 2D CNN's complexity.
Recurrent neural nets (RNN) and convolutional neural nets (CNN) are widely used on NLP tasks to capture the long-term and local dependencies, respectively. Attention mechanisms have recently attracted enormous interest due to their highly parallelizable computation, significantly less training time, and flexibility in modeling dependencies. We propose a novel attention mechanism in which the attention between elements from input sequence(s) is directional and multi-dimensional (i.e., feature-wise). A light-weight neural net, "Directional Self-Attention Network (DiSAN)", is then proposed to learn sentence embedding, based solely on the proposed attention without any RNN/CNN structure. DiSAN is only composed of a directional self-attention with temporal order encoded, followed by a multi-dimensional attention that compresses the sequence into a vector representation. Despite its simple form, DiSAN outperforms complicated RNN models on both prediction quality and time efficiency. It achieves the best test accuracy among all sentence encoding methods and improves the most recent best result by 1.02% on the Stanford Natural Language Inference (SNLI) dataset, and shows state-of-the-art test accuracy on the Stanford Sentiment Treebank (SST), Multi-Genre natural language inference (MultiNLI), Sentences Involving Compositional Knowledge (SICK), Customer Review, MPQA, TREC question-type classification and Subjectivity (SUBJ) datasets.