We investigated the imaging performance of a fast convergent ordered-subsets algorithm with subiteration-dependent preconditioners (SDPs) for positron emission tomography (PET) image reconstruction. In particular, we considered the use of SDP with the block sequential regularized expectation maximization (BSREM) approach with the relative difference prior (RDP) regularizer due to its prior clinical adaptation by vendors. Because the RDP regularization promotes smoothness in the reconstructed image, the directions of the gradients in smooth areas more accurately point toward the objective function's minimizer than those in variable areas. Motivated by this observation, two SDPs have been designed to increase iteration step-sizes in the smooth areas and reduce iteration step-sizes in the variable areas relative to a conventional expectation maximization preconditioner. The momentum technique used for convergence acceleration can be viewed as a special case of SDP. We have proved the global convergence of SDP-BSREM algorithms by assuming certain characteristics of the preconditioner. By means of numerical experiments using both simulated and clinical PET data, we have shown that our SDP-BSREM substantially improved the convergence rate, as compared to conventional BSREM and a vendor's implementation as Q.Clear. Specifically, in numerical experiments with a synthetic brain phantom, both proposed algorithms outperformed the conventional BSREM by a factor of two, while in experiments with clinical whole-body patient PET data (with and without time-of-flight information), the SDP-BSREM algorithm converged 35\%-50\% percent faster than the commercial Q.Clear algorithm.
Video analytics pipelines have steadily shifted to edge deployments to reduce bandwidth overheads and privacy violations, but in doing so, face an ever-growing resource tension. Most notably, edge-box GPUs lack the memory needed to concurrently house the growing number of (increasingly complex) models for real-time inference. Unfortunately, existing solutions that rely on time/space sharing of GPU resources are insufficient as the required swapping delays result in unacceptable frame drops and accuracy violations. We present model merging, a new memory management technique that exploits architectural similarities between edge vision models by judiciously sharing their layers (including weights) to reduce workload memory costs and swapping delays. Our system, GEMEL, efficiently integrates merging into existing pipelines by (1) leveraging several guiding observations about per-model memory usage and inter-layer dependencies to quickly identify fruitful and accuracy-preserving merging configurations, and (2) altering edge inference schedules to maximize merging benefits. Experiments across diverse workloads reveal that GEMEL reduces memory usage by up to 60.7%, and improves overall accuracy by 8-39% relative to time/space sharing alone.
Image reconstruction based on indirect, noisy, or incomplete data remains an important yet challenging task. While methods such as compressive sensing have demonstrated high-resolution image recovery in various settings, there remain issues of robustness due to parameter tuning. Moreover, since the recovery is limited to a point estimate, it is impossible to quantify the uncertainty, which is often desirable. Due to these inherent limitations, a sparse Bayesian learning approach is sometimes adopted to recover a posterior distribution of the unknown. Sparse Bayesian learning assumes that some linear transformation of the unknown is sparse. However, most of the methods developed are tailored to specific problems, with particular forward models and priors. Here, we present a generalized approach to sparse Bayesian learning. It has the advantage that it can be used for various types of data acquisitions and prior information. Some preliminary results on image reconstruction/recovery indicate its potential use for denoising, deblurring, and magnetic resonance imaging.
Recent studies have significantly enhanced the performance of single-image super-resolution (SR) using convolutional neural networks (CNNs). While there can be many high-resolution (HR) solutions for a given input, most existing CNN-based methods do not explore alternative solutions during the inference. A typical approach to obtaining alternative SR results is to train multiple SR models with different loss weightings and exploit the combination of these models. Instead of using multiple models, we present a more efficient method to train a single adjustable SR model on various combinations of losses by taking advantage of multi-task learning. Specifically, we optimize an SR model with a conditional objective during training, where the objective is a weighted sum of multiple perceptual losses at different feature levels. The weights vary according to given conditions, and the set of weights is defined as a style controller. Also, we present an architecture appropriate for this training scheme, which is the Residual-in-Residual Dense Block equipped with spatial feature transformation layers. At the inference phase, our trained model can generate locally different outputs conditioned on the style control map. Extensive experiments show that the proposed SR model produces various desirable reconstructions without artifacts and yields comparable quantitative performance to state-of-the-art SR methods.
Internet-of-vehicle (IoV) is a general concept referring to, e.g., autonomous drive based vehicle-to-everything (V2X) communications or moving relays. Here, high rate and reliability demands call for advanced multi-antenna techniques and millimeter-wave (mmw) based communications. However, the sensitivity of the mmw signals to blockage may limit the system performance, especially in highways/rural areas with limited building reflectors/base station deployments and high-speed devices. To avoid the blockage, various techniques have been proposed among which reconfigurable intelligent surface (RIS) is a candidate. RIS, however, has been mainly of interest in stationary/low mobility scenarios, due to the associated channel state information acquisition and beam management overhead as well as imperfect reflection. In this article, we study the potentials and challenges of RIS-assisted dynamic blockage avoidance in IoV networks. Particularly, by designing region-based RIS pre-selection as well as blockage prediction schemes, we show that RIS-assisted communication has the potential to boost the performance of IoV networks. However, there are still issues to be solved before RIS can be practically deployed in IoV networks.
We address general-shaped clustering problems under very weak parametric assumptions with a two-step hybrid robust clustering algorithm based on trimmed k-means and hierarchical agglomeration. The algorithm has low computational complexity and effectively identifies the clusters also in presence of data contamination. We also present natural generalizations of the approach as well as an adaptive procedure to estimate the amount of contamination in a data-driven fashion. Our proposal outperforms state-of-the-art robust, model-based methods in our numerical simulations and real-world applications related to color quantization for image analysis, human mobility patterns based on GPS data, biomedical images of diabetic retinopathy, and functional data across weather stations.
Massive multiple-input multiple-output (MIMO) is promising for low earth orbit (LEO) satellite communications due to the potential in enhancing the spectral efficiency. However, the conventional fully digital precoding architectures might lead to high implementation complexity and energy consumption. In this paper, hybrid analog/digital precoding solutions are developed for the downlink operation in LEO massive MIMO satellite communications, by exploiting the slow-varying statistical channel state information (CSI) at the transmitter. First, we formulate the hybrid precoder design as an energy efficiency (EE) maximization problem by considering both the continuous and discrete phase shift networks for implementing the analog precoder. The cases of both the fully and the partially connected architectures are considered. Since the EE optimization problem is nonconvex, it is in general difficult to solve. To make the EE maximization problem tractable, we apply a closed-form tight upper bound to approximate the ergodic rate. Then, we develop an efficient algorithm to obtain the fully digital precoders. Based on which, we further develop two different efficient algorithmic solutions to compute the hybrid precoders for the fully and the partially connected architectures, respectively. Simulation results show that the proposed approaches achieve significant EE performance gains over the existing baselines, especially when the discrete phase shift network is employed for analog precoding.
As the training of giant dense models hits the boundary on the availability and capability of the hardware resources today, Mixture-of-Experts (MoE) models become one of the most promising model architectures due to their significant training cost reduction compared to a quality-equivalent dense model. Its training cost saving is demonstrated from encoder-decoder models (prior works) to a 5x saving for auto-aggressive language models (this work along with parallel explorations). However, due to the much larger model size and unique architecture, how to provide fast MoE model inference remains challenging and unsolved, limiting its practical usage. To tackle this, we present DeepSpeed-MoE, an end-to-end MoE training and inference solution as part of the DeepSpeed library, including novel MoE architecture designs and model compression techniques that reduce MoE model size by up to 3.7x, and a highly optimized inference system that provides 7.3x better latency and cost compared to existing MoE inference solutions. DeepSpeed-MoE offers an unprecedented scale and efficiency to serve massive MoE models with up to 4.5x faster and 9x cheaper inference compared to quality-equivalent dense models. We hope our innovations and systems help open a promising path to new directions in the large model landscape, a shift from dense to sparse MoE models, where training and deploying higher-quality models with fewer resources becomes more widely possible.
Lattice Boltzmann schemes rely on the enlargement of the size of the target problem in order to solve PDEs in a highly parallelizable and efficient kinetic-like fashion, split into a collision and a stream phase. This structure, despite the well-known advantages from a computational standpoint, is not suitable to construct a rigorous notion of consistency with respect to the target equations and to provide a precise notion of stability. In order to alleviate these shortages and introduce a rigorous framework, we demonstrate that any lattice Boltzmann scheme can be rewritten as a corresponding multi-step Finite Difference scheme on the conserved variables. This is achieved by devising a suitable formalism based on operators, commutative algebra and polynomials. Therefore, the notion of consistency of the corresponding Finite Difference scheme allows to invoke the Lax-Richtmyer theorem in the case of linear lattice Boltzmann schemes. Moreover, we show that the frequently-used von Neumann-like stability analysis for lattice Boltzmann schemes entirely corresponds to the von Neumann stability analysis of their Finite Difference counterpart. More generally, the usual tools for the analysis of Finite Difference schemes are now readily available to study lattice Boltzmann schemes. Their relevance is verified by means of numerical illustrations.
This paper studies the single image super-resolution problem using adder neural networks (AdderNet). Compared with convolutional neural networks, AdderNet utilizing additions to calculate the output features thus avoid massive energy consumptions of conventional multiplications. However, it is very hard to directly inherit the existing success of AdderNet on large-scale image classification to the image super-resolution task due to the different calculation paradigm. Specifically, the adder operation cannot easily learn the identity mapping, which is essential for image processing tasks. In addition, the functionality of high-pass filters cannot be ensured by AdderNet. To this end, we thoroughly analyze the relationship between an adder operation and the identity mapping and insert shortcuts to enhance the performance of SR models using adder networks. Then, we develop a learnable power activation for adjusting the feature distribution and refining details. Experiments conducted on several benchmark models and datasets demonstrate that, our image super-resolution models using AdderNet can achieve comparable performance and visual quality to that of their CNN baselines with an about 2$\times$ reduction on the energy consumption.
The use of object detection algorithms is becoming increasingly important in autonomous vehicles, and object detection at high accuracy and a fast inference speed is essential for safe autonomous driving. A false positive (FP) from a false localization during autonomous driving can lead to fatal accidents and hinder safe and efficient driving. Therefore, a detection algorithm that can cope with mislocalizations is required in autonomous driving applications. This paper proposes a method for improving the detection accuracy while supporting a real-time operation by modeling the bounding box (bbox) of YOLOv3, which is the most representative of one-stage detectors, with a Gaussian parameter and redesigning the loss function. In addition, this paper proposes a method for predicting the localization uncertainty that indicates the reliability of bbox. By using the predicted localization uncertainty during the detection process, the proposed schemes can significantly reduce the FP and increase the true positive (TP), thereby improving the accuracy. Compared to a conventional YOLOv3, the proposed algorithm, Gaussian YOLOv3, improves the mean average precision (mAP) by 3.09 and 3.5 on the KITTI and Berkeley deep drive (BDD) datasets, respectively. In addition, on the same datasets, the proposed algorithm can reduce the FP by 41.40% and 40.62%, and increase the TP by 7.26% and 4.3%, respectively. Nevertheless, the proposed algorithm is capable of real-time detection at faster than 42 frames per second (fps).