Deep learning-based image signal processor (ISP) models for mobile cameras can generate high-quality images that rival those of professional DSLR cameras. However, their computational demands often make them unsuitable for mobile settings. Additionally, modern mobile cameras employ non-Bayer color filter arrays (CFA) such as Quad Bayer, Nona Bayer, and QxQ Bayer to enhance image quality, yet most existing deep learning-based ISP (or demosaicing) models focus primarily on standard Bayer CFAs. In this study, we present PyNET-QxQ, a lightweight demosaicing model specifically designed for QxQ Bayer CFA patterns, which is derived from the original PyNET. We also propose a knowledge distillation method called progressive distillation to train the reduced network more effectively. Consequently, PyNET-QxQ contains less than 2.5% of the parameters of the original PyNET while preserving its performance. Experiments using QxQ images captured by a proto type QxQ camera sensor show that PyNET-QxQ outperforms existing conventional algorithms in terms of texture and edge reconstruction, despite its significantly reduced parameter count.
The accurate modeling of dynamics in interactive environments is critical for successful long-range prediction. Such a capability could advance Reinforcement Learning (RL) and Planning algorithms, but achieving it is challenging. Inaccuracies in model estimates can compound, resulting in increased errors over long horizons. We approach this problem from the lens of Koopman theory, where the nonlinear dynamics of the environment can be linearized in a high-dimensional latent space. This allows us to efficiently parallelize the sequential problem of long-range prediction using convolution, while accounting for the agent's action at every time step. Our approach also enables stability analysis and better control over gradients through time. Taken together, these advantages result in significant improvement over the existing approaches, both in the efficiency and the accuracy of modeling dynamics over extended horizons. We also report promising experimental results in dynamics modeling for the scenarios of both model-based planning and model-free RL.
Given an arbitrary audio clip, audio-driven 3D facial animation aims to generate lifelike lip motions and facial expressions for a 3D head. Existing methods typically rely on training their models using limited public 3D datasets that contain a restricted number of audio-3D scan pairs. Consequently, their generalization capability remains limited. In this paper, we propose a novel method that leverages in-the-wild 2D talking-head videos to train our 3D facial animation model. The abundance of easily accessible 2D talking-head videos equips our model with a robust generalization capability. By combining these videos with existing 3D face reconstruction methods, our model excels in generating consistent and high-fidelity lip synchronization. Additionally, our model proficiently captures the speaking styles of different individuals, allowing it to generate 3D talking-heads with distinct personal styles. Extensive qualitative and quantitative experimental results demonstrate the superiority of our method.
Eye blinking detection in the wild plays an essential role in deception detection, driving fatigue detection, etc. Despite the fact that numerous attempts have already been made, the majority of them have encountered difficulties, such as the derived eye images having different resolutions as the distance between the face and the camera changes; or the requirement of a lightweight detection model to obtain a short inference time in order to perform in real-time. In this research, two problems are addressed: how the eye blinking detection model can learn efficiently from different resolutions of eye pictures in diverse conditions; and how to reduce the size of the detection model for faster inference time. We propose to utilize upsampling and downsampling the input eye images to the same resolution as one potential solution for the first problem, then find out which interpolation method can result in the highest performance of the detection model. For the second problem, although a recent spatiotemporal convolutional neural network used for eye blinking detection has a strong capacity to extract both spatial and temporal characteristics, it remains having a high number of network parameters, leading to high inference time. Therefore, using Depth-wise Separable Convolution rather than conventional convolution layers inside each branch is considered in this paper as a feasible solution.
Linear State Space Models (SSMs) have demonstrated strong performance in a variety of sequence modeling tasks due to their efficient encoding of the recurrent structure. However, in more comprehensive tasks like language modeling and machine translation, self-attention-based models still outperform SSMs. Hybrid models employing both SSM and self-attention generally show promising performance, but current approaches apply attention modules statically and uniformly to all elements in the input sequences, leading to sub-optimal quality-efficiency trade-offs. In this work, we introduce Sparse Modular Activation (SMA), a general mechanism enabling neural networks to sparsely and dynamically activate sub-modules for sequence elements in a differentiable manner. Through allowing each element to skip non-activated sub-modules, SMA reduces computation and memory consumption at both training and inference stages of sequence modeling. As a specific instantiation of SMA, we design a novel neural architecture, SeqBoat, which employs SMA to sparsely activate a Gated Attention Unit (GAU) based on the state representations learned from an SSM. By constraining the GAU to only conduct local attention on the activated inputs, SeqBoat can achieve linear inference complexity with theoretically infinite attention span, and provide substantially better quality-efficiency trade-off than the chunking-based models. With experiments on a wide range of tasks, including language modeling, speech classification and long-range arena, SeqBoat brings new state-of-the-art results among hybrid models with linear complexity and reveals the amount of attention needed for each task through the learned sparse activation patterns.
This paper describes an efficient unsupervised learning method for a neural source separation model that utilizes a probabilistic generative model of observed multichannel mixtures proposed for blind source separation (BSS). For this purpose, amortized variational inference (AVI) has been used for directly solving the inverse problem of BSS with full-rank spatial covariance analysis (FCA). Although this unsupervised technique called neural FCA is in principle free from the domain mismatch problem, it is computationally demanding due to the full rankness of the spatial model in exchange for robustness against relatively short reverberations. To reduce the model complexity without sacrificing performance, we propose neural FastFCA based on the jointly-diagonalizable yet full-rank spatial model. Our neural separation model introduced for AVI alternately performs neural network blocks and single steps of an efficient iterative algorithm called iterative source steering. This alternating architecture enables the separation model to quickly separate the mixture spectrogram by leveraging both the deep neural network and the multichannel optimization algorithm. The training objective with AVI is derived to maximize the marginalized likelihood of the observed mixtures. The experiment using mixture signals of two to four sound sources shows that neural FastFCA outperforms conventional BSS methods and reduces the computational time to about 2% of that for the neural FCA.
Objective: This paper investigates how generative models, trained on ground-truth images, can be used \changes{as} priors for inverse problems, penalizing reconstructions far from images the generator can produce. The aim is that learned regularization will provide complex data-driven priors to inverse problems while still retaining the control and insight of a variational regularization method. Moreover, unsupervised learning, without paired training data, allows the learned regularizer to remain flexible to changes in the forward problem such as noise level, sampling pattern or coil sensitivities in MRI. Approach: We utilize variational autoencoders (VAEs) that generate not only an image but also a covariance uncertainty matrix for each image. The covariance can model changing uncertainty dependencies caused by structure in the image, such as edges or objects, and provides a new distance metric from the manifold of learned images. Main results: We evaluate these novel generative regularizers on retrospectively sub-sampled real-valued MRI measurements from the fastMRI dataset. We compare our proposed learned regularization against other unlearned regularization approaches and unsupervised and supervised deep learning methods. Significance: Our results show that the proposed method is competitive with other state-of-the-art methods and behaves consistently with changing sampling patterns and noise levels.
UAV (unmanned aerial vehicle) is rapidly gaining traction in various human activities and has become an integral component of the satellite-air-ground-sea (SAGS) integrated network. As high-speed moving objects, UAVs not only have extremely strict requirements for communication delay, but also cannot be maliciously controlled as a weapon by the attacker. Therefore, an efficient and secure communication method designed for UAV networks is necessary. We propose a communication mechanism ESCM. For high efficiency, ESCM provides a routing protocol based on the artificial bee colony (ABC) algorithm to accelerate communications between UAVs. Meanwhile, we use blockchain to guarantee the security of UAV networks. However, blockchain has unstable links in high-mobility networks resulting in low consensus efficiency and high communication overhead. Consequently, ESCM introduces digital twin (DT), which transforms the UAV network into a static network by mapping UAVs from the physical world into Cyberspace. This virtual UAV network is called CyberUAV. Then, in CyberUAV, we design a blockchain consensus based on network coding, named Proof of Network Coding (PoNC). Analysis and simulation show that the above modules in ESCM have advantages over existing schemes. Through ablation studies, we demonstrate that these modules are indispensable for efficient and secure communication of UAV networks.
Multivariate sequential data collected in practice often exhibit temporal irregularities, including nonuniform time intervals and component misalignment. However, if uneven spacing and asynchrony are endogenous characteristics of the data rather than a result of insufficient observation, the information content of these irregularities plays a defining role in characterizing the multivariate dependence structure. Existing approaches for probabilistic forecasting either overlook the resulting statistical heterogeneities, are susceptible to imputation biases, or impose parametric assumptions on the data distribution. This paper proposes an end-to-end solution that overcomes these limitations by allowing the observation arrival times to play the central role of model construction, which is at the core of temporal irregularities. To acknowledge temporal irregularities, we first enable unique hidden states for components so that the arrival times can dictate when, how, and which hidden states to update. We then develop a conditional flow representation to non-parametrically represent the data distribution, which is typically non-Gaussian, and supervise this representation by carefully factorizing the log-likelihood objective to select conditional information that facilitates capturing time variation and path dependency. The broad applicability and superiority of the proposed solution are confirmed by comparing it with existing approaches through ablation studies and testing on real-world datasets.
Conventionally, spatiotemporal modeling network and its complexity are the two most concentrated research topics in video action recognition. Existing state-of-the-art methods have achieved excellent accuracy regardless of the complexity meanwhile efficient spatiotemporal modeling solutions are slightly inferior in performance. In this paper, we attempt to acquire both efficiency and effectiveness simultaneously. First of all, besides traditionally treating H x W x T video frames as space-time signal (viewing from the Height-Width spatial plane), we propose to also model video from the other two Height-Time and Width-Time planes, to capture the dynamics of video thoroughly. Secondly, our model is designed based on 2D CNN backbones and model complexity is well kept in mind by design. Specifically, we introduce a novel multi-view fusion (MVF) module to exploit video dynamics using separable convolution for efficiency. It is a plug-and-play module and can be inserted into off-the-shelf 2D CNNs to form a simple yet effective model called MVFNet. Moreover, MVFNet can be thought of as a generalized video modeling framework and it can specialize to be existing methods such as C2D, SlowOnly, and TSM under different settings. Extensive experiments are conducted on popular benchmarks (i.e., Something-Something V1 & V2, Kinetics, UCF-101, and HMDB-51) to show its superiority. The proposed MVFNet can achieve state-of-the-art performance with 2D CNN's complexity.
We introduce a generic framework that reduces the computational cost of object detection while retaining accuracy for scenarios where objects with varied sizes appear in high resolution images. Detection progresses in a coarse-to-fine manner, first on a down-sampled version of the image and then on a sequence of higher resolution regions identified as likely to improve the detection accuracy. Built upon reinforcement learning, our approach consists of a model (R-net) that uses coarse detection results to predict the potential accuracy gain for analyzing a region at a higher resolution and another model (Q-net) that sequentially selects regions to zoom in. Experiments on the Caltech Pedestrians dataset show that our approach reduces the number of processed pixels by over 50% without a drop in detection accuracy. The merits of our approach become more significant on a high resolution test set collected from YFCC100M dataset, where our approach maintains high detection performance while reducing the number of processed pixels by about 70% and the detection time by over 50%.