In this work, we present a novel framework for camera relocation in autonomous vehicles, leveraging deep neural networks (DNN). While existing literature offers various DNN-based camera relocation methods, their deployment is hindered by their high computational demands during inference. In contrast, our approach addresses this challenge through edge cloud collaboration. Specifically, we strategically offload certain modules of the neural network to the server and evaluate the inference time of data frames under different network segmentation schemes to guide our offloading decisions. Our findings highlight the vital role of server-side offloading in DNN-based camera relocation for autonomous vehicles, and we also discuss the results of data fusion. Finally, we validate the effectiveness of our proposed framework through experimental evaluation.
In this correspondence, we propose to use an intelligent reflective surface (IRS) installed on unmanned aerial vehicle (UAV), referred to as aerial IRS (AIRS), for vehicular networks, where simultaneous wireless information and power transfer (SWIPT) receivers to concurrently allow information decoding (ID) and energy harvesting (EH) are equipped at the battery-limited vehicles. For efficiently supporting the multiple moving vehicles, we adopt rate-splitting multiple access (RSMA) technique. With the aim of maximizing the sum rate of vehicles, we jointly optimize trajectory and phase shift design of AIRS, transmit power and rate allocation for RSMA along with power splitting ratio for SWIPT implementation. Via simulations, the superior performances of the proposed algorithm are validated compared to the conventional partial optimizations.
We develop a versatile framework for statistical learning in non-stationary environments. In each time period, our approach applies a stability principle to select a look-back window that maximizes the utilization of historical data while keeping the cumulative bias within an acceptable range relative to the stochastic error. Our theory showcases the adaptability of this approach to unknown non-stationarity. The regret bound is minimax optimal up to logarithmic factors when the population losses are strongly convex, or Lipschitz only. At the heart of our analysis lie two novel components: a measure of similarity between functions and a segmentation technique for dividing the non-stationary data sequence into quasi-stationary pieces.
In this paper, we present a simultaneous target tracking and multi-user communications system realized by a full duplex holographic Multiple-Input Multiple-Output (MIMO) node equipped with Dynamic Metasurface Antennas (DMAs) at both its communication ends. Focusing on the near-field regime, we extend Fresnel's approximation to metasurfaces and devise a subspace tracking scheme with DMA-based hybrid Analog and Digital (A/D) reception as well as hybrid A/D transmission with a DMA for sum-rate maximization. The presented simulation results corroborate the efficiency of the proposed framework for various system parameters.
Contour-based instance segmentation has been actively studied, thanks to its flexibility and elegance in processing visual objects within complex backgrounds. In this work, we propose a novel deep network architecture, i.e., PolySnake, for generic contour-based instance segmentation. Motivated by the classic Snake algorithm, the proposed PolySnake achieves superior and robust segmentation performance with an iterative and progressive contour refinement strategy. Technically, PolySnake introduces a recurrent update operator to estimate the object contour iteratively. It maintains a single estimate of the contour that is progressively deformed toward the object boundary. At each iteration, PolySnake builds a semantic-rich representation for the current contour and feeds it to the recurrent operator for further contour adjustment. Through the iterative refinements, the contour progressively converges to a stable status that tightly encloses the object instance. Beyond the scope of general instance segmentation, extensive experiments are conducted to validate the effectiveness and generalizability of our PolySnake in two additional specific task scenarios, including scene text detection and lane detection. The results demonstrate that the proposed PolySnake outperforms the existing advanced methods on several multiple prevalent benchmarks across the three tasks. The codes and pre-trained models are available at //github.com/fh2019ustc/PolySnake
With the advancement of neural networks, there has been a notable increase, both in terms of quantity and variety, in research publications concerning the application of autoencoders to reduced-order models. We propose a polytopic autoencoder architecture that includes a lightweight nonlinear encoder, a convex combination decoder, and a smooth clustering network. Supported by several proofs, the model architecture ensures that all reconstructed states lie within a polytope, accompanied by a metric indicating the quality of the constructed polytopes, referred to as polytope error. Additionally, it offers a minimal number of convex coordinates for polytopic linear-parameter varying systems while achieving acceptable reconstruction errors compared to proper orthogonal decomposition (POD). To validate our proposed model, we conduct simulations involving two flow scenarios with the incompressible Navier-Stokes equation. Numerical results demonstrate the guaranteed properties of the model, low reconstruction errors compared to POD, and the improvement in error using a clustering network.
Recently, graph neural networks have been gaining a lot of attention to simulate dynamical systems due to their inductive nature leading to zero-shot generalizability. Similarly, physics-informed inductive biases in deep-learning frameworks have been shown to give superior performance in learning the dynamics of physical systems. There is a growing volume of literature that attempts to combine these two approaches. Here, we evaluate the performance of thirteen different graph neural networks, namely, Hamiltonian and Lagrangian graph neural networks, graph neural ODE, and their variants with explicit constraints and different architectures. We briefly explain the theoretical formulation highlighting the similarities and differences in the inductive biases and graph architecture of these systems. We evaluate these models on spring, pendulum, gravitational, and 3D deformable solid systems to compare the performance in terms of rollout error, conserved quantities such as energy and momentum, and generalizability to unseen system sizes. Our study demonstrates that GNNs with additional inductive biases, such as explicit constraints and decoupling of kinetic and potential energies, exhibit significantly enhanced performance. Further, all the physics-informed GNNs exhibit zero-shot generalizability to system sizes an order of magnitude larger than the training system, thus providing a promising route to simulate large-scale realistic systems.
Most recent semantic segmentation methods adopt a fully-convolutional network (FCN) with an encoder-decoder architecture. The encoder progressively reduces the spatial resolution and learns more abstract/semantic visual concepts with larger receptive fields. Since context modeling is critical for segmentation, the latest efforts have been focused on increasing the receptive field, through either dilated/atrous convolutions or inserting attention modules. However, the encoder-decoder based FCN architecture remains unchanged. In this paper, we aim to provide an alternative perspective by treating semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer (ie, without convolution and resolution reduction) to encode an image as a sequence of patches. With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR). Extensive experiments show that SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes. Particularly, we achieve the first (44.42% mIoU) position in the highly competitive ADE20K test server leaderboard.
Deep neural networks (DNNs) are successful in many computer vision tasks. However, the most accurate DNNs require millions of parameters and operations, making them energy, computation and memory intensive. This impedes the deployment of large DNNs in low-power devices with limited compute resources. Recent research improves DNN models by reducing the memory requirement, energy consumption, and number of operations without significantly decreasing the accuracy. This paper surveys the progress of low-power deep learning and computer vision, specifically in regards to inference, and discusses the methods for compacting and accelerating DNN models. The techniques can be divided into four major categories: (1) parameter quantization and pruning, (2) compressed convolutional filters and matrix factorization, (3) network architecture search, and (4) knowledge distillation. We analyze the accuracy, advantages, disadvantages, and potential solutions to the problems with the techniques in each category. We also discuss new evaluation metrics as a guideline for future research.
In this paper, we propose a novel multi-task learning architecture, which incorporates recent advances in attention mechanisms. Our approach, the Multi-Task Attention Network (MTAN), consists of a single shared network containing a global feature pool, together with task-specific soft-attention modules, which are trainable in an end-to-end manner. These attention modules allow for learning of task-specific features from the global pool, whilst simultaneously allowing for features to be shared across different tasks. The architecture can be built upon any feed-forward neural network, is simple to implement, and is parameter efficient. Experiments on the CityScapes dataset show that our method outperforms several baselines in both single-task and multi-task learning, and is also more robust to the various weighting schemes in the multi-task loss function. We further explore the effectiveness of our method through experiments over a range of task complexities, and show how our method scales well with task complexity compared to baselines.
In this paper, we propose the joint learning attention and recurrent neural network (RNN) models for multi-label classification. While approaches based on the use of either model exist (e.g., for the task of image captioning), training such existing network architectures typically require pre-defined label sequences. For multi-label classification, it would be desirable to have a robust inference process, so that the prediction error would not propagate and thus affect the performance. Our proposed model uniquely integrates attention and Long Short Term Memory (LSTM) models, which not only addresses the above problem but also allows one to identify visual objects of interests with varying sizes without the prior knowledge of particular label ordering. More importantly, label co-occurrence information can be jointly exploited by our LSTM model. Finally, by advancing the technique of beam search, prediction of multiple labels can be efficiently achieved by our proposed network model.