亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

Reconfigurable intelligent surface (RIS) has gained much traction due to its potential to manipulate the propagation environment via nearly-passive reconfigurable elements. In our previous work, we have analyzed and proposed a beyond diagonal RIS (BD-RIS) model, which is not limited to traditional diagonal phase shift matrices, to unify different RIS modes/architectures. In this paper, we create a new branch of BD-RIS supporting a multi-sector mode. A multi-sector BD-RIS is modeled as multiple antennas connected to a multi-port group-connected reconfigurable impedance network. More specifically, antennas are divided into $L$ ($L \ge 2$) sectors and arranged as a polygon prism with each sector covering $1/L$ space. Different from the recently introduced concept of intelligent omni-surface (or simultaneously transmitting and reflecting RIS), the multi-sector BD-RIS not only achieves a full-space coverage, but also has significant performance gains thanks to the highly directional beam of each sector.We derive the constraint of the multi-sector BD-RIS and the corresponding channel model taking into account the relationship between antenna beamwidth and gain. With the proposed model, we first derive the scaling law of the received signal power for a multi-sector BD-RIS-assisted single-user system. We then propose efficient beamforming design algorithms to maximize the sum-rate of the multi-sector BD-RIS-assisted multiuser system. Simulation results verify the effectiveness of the proposed design and demonstrate the performance enhancement of the proposed multi-sector BD-RIS.

相關內容

Dense retrieval has shown promise in the first-stage retrieval process when trained on in-domain labeled datasets. However, previous studies have found that dense retrieval is hard to generalize to unseen domains due to its weak modeling of domain-invariant and interpretable feature (i.e., matching signal between two texts, which is the essence of information retrieval). In this paper, we propose a novel method to improve the generalization of dense retrieval via capturing matching signal called BERM. Fully fine-grained expression and query-oriented saliency are two properties of the matching signal. Thus, in BERM, a single passage is segmented into multiple units and two unit-level requirements are proposed for representation as the constraint in training to obtain the effective matching signal. One is semantic unit balance and the other is essential matching unit extractability. Unit-level view and balanced semantics make representation express the text in a fine-grained manner. Essential matching unit extractability makes passage representation sensitive to the given query to extract the pure matching information from the passage containing complex context. Experiments on BEIR show that our method can be effectively combined with different dense retrieval training methods (vanilla, hard negatives mining and knowledge distillation) to improve its generalization ability without any additional inference overhead and target domain data.

Inspired by the traditional partial differential equation (PDE) approach for image denoising, we propose a novel neural network architecture, referred as NODE-ImgNet, that combines neural ordinary differential equations (NODEs) with convolutional neural network (CNN) blocks. NODE-ImgNet is intrinsically a PDE model, where the dynamic system is learned implicitly without the explicit specification of the PDE. This naturally circumvents the typical issues associated with introducing artifacts during the learning process. By invoking such a NODE structure, which can also be viewed as a continuous variant of a residual network (ResNet) and inherits its advantage in image denoising, our model achieves enhanced accuracy and parameter efficiency. In particular, our model exhibits consistent effectiveness in different scenarios, including denoising gray and color images perturbed by Gaussian noise, as well as real-noisy images, and demonstrates superiority in learning from small image datasets.

Future sixth-generation (6G) networks are envisioned to provide both sensing and communications functionalities by using densely deployed base stations (BSs) with massive antennas operating in millimeter wave (mmWave) and terahertz (THz). Due to the large number of antennas and the high frequency band, the sensing and communications will operate within the near-field region, thus making the conventional designs based on the far-field channel models inapplicable. This paper studies a near-field multiple-input-multiple-output (MIMO) radar sensing system, in which the transceivers with massive antennas aim to localize multiple near-field targets in the three-dimensional (3D) space. In particular, we adopt a general wavefront propagation model by considering the exact spherical wavefront with both channel phase and amplitude variations over different antennas. Besides, we consider the general transmit signal waveforms and also consider the unknown cluttered environments. Under this setup, the unknown parameters to estimate include the 3D coordinates and the complex reflection coefficients of the multiple targets, as well as the noise and interference covariance matrix. Accordingly, we derive the Cram\'er-Rao bound (CRB) for estimating the target coordinates and reflection coefficients. Next, to facilitate practical localization, we propose an efficient estimator based on the 3D approximate cyclic optimization (3D-ACO), which is obtained following the maximum likelihood (ML) criterion. Finally, numerical results show that considering the exact antenna-varying channel amplitudes achieves more accurate CRB as compared to prior works based on constant channel amplitudes across antennas, especially when the targets are close to the transceivers. It is also shown that the proposed estimator achieves localization performance close to the derived CRB, thus validating its superior performance.

Action knowledge involves the understanding of textual, visual, and temporal aspects of actions. We introduce the Action Dynamics Benchmark (ActionBench) containing two carefully designed probing tasks: Action Antonym and Video Reversal, which targets multimodal alignment capabilities and temporal understanding skills of the model, respectively. Despite recent video-language models' (VidLM) impressive performance on various benchmark tasks, our diagnostic tasks reveal their surprising deficiency (near-random performance) in action knowledge, suggesting that current models rely on object recognition abilities as a shortcut for action understanding. To remedy this, we propose a novel framework, Paxion, along with a new Discriminative Video Dynamics Modeling (DVDM) objective. The Paxion framework utilizes a Knowledge Patcher network to encode new action knowledge and a Knowledge Fuser component to integrate the Patcher into frozen VidLMs without compromising their existing capabilities. Due to limitations of the widely-used Video-Text Contrastive (VTC) loss for learning action knowledge, we introduce the DVDM objective to train the Knowledge Patcher. DVDM forces the model to encode the correlation between the action text and the correct ordering of video frames. Our extensive analyses show that Paxion and DVDM together effectively fill the gap in action knowledge understanding (~50% to 80%), while maintaining or improving performance on a wide spectrum of both object- and action-centric downstream tasks.

The reconfigurable intelligent surface (RIS) is an emerging technology that changes how wireless networks are perceived, therefore its potential benefits and applications are currently under intense research and investigation. In this letter, we focus on electromagnetically consistent models for RISs inheriting from a recently proposed model based on mutually coupled loaded wire dipoles. While existing related research focuses on free-space wireless channels thereby ignoring interactions between RIS and scattering objects present in the propagation environment, we introduce an RIS-aided channel model that is applicable to more realistic scenarios, where the scattering objects are modeled as loaded wire dipoles. By adjusting the parameters of the wire dipoles, the properties of general natural and engineered material objects can be modeled. Based on this model, we introduce a provably convergent and efficient iterative algorithm that jointly optimizes the RIS and transmitter configurations to maximize the system sum-rate. Extensive numerical results show the net performance improvement provided by the proposed method compared with existing optimization algorithms.

The composed image retrieval (CIR) task aims to retrieve the desired target image for a given multimodal query, i.e., a reference image with its corresponding modification text. The key limitations encountered by existing efforts are two aspects: 1) ignoring the multi-faceted query-target matching factors; 2) ignoring the potential unlabeled reference-target image pairs in existing benchmark datasets. To address these two limitations is non-trivial due to the following challenges: 1) how to effectively model the multi-faceted matching factors in a latent way without direct supervision signals; 2) how to fully utilize the potential unlabeled reference-target image pairs to improve the generalization ability of the CIR model. To address these challenges, in this work, we first propose a muLtI-faceted Matching Network (LIMN), which consists of three key modules: multi-grained image/text encoder, latent factor-oriented feature aggregation, and query-target matching modeling. Thereafter, we design an iterative dual self-training paradigm to further enhance the performance of LIMN by fully utilizing the potential unlabeled reference-target image pairs in a semi-supervised manner. Specifically, we denote the iterative dual self-training paradigm enhanced LIMN as LIMN+. Extensive experiments on three real-world datasets, FashionIQ, Shoes, and Birds-to-Words, show that our proposed method significantly surpasses the state-of-the-art baselines.

This paper presents a pixel selection method for compact image representation based on superpixel segmentation and tensor completion. Our method divides the image into several regions that capture important textures or semantics and selects a representative pixel from each region to store. We experiment with different criteria for choosing the representative pixel and find that the centroid pixel performs the best. We also propose two smooth tensor completion algorithms that can effectively reconstruct different types of images from the selected pixels. Our experiments show that our superpixel-based method achieves better results than uniform sampling for various missing ratios.

3D LiDAR-based single object tracking (SOT) has gained increasing attention as it plays a crucial role in 3D applications such as autonomous driving. The central problem is how to learn a target-aware representation from the sparse and incomplete point clouds. In this paper, we propose a novel Correlation Pyramid Network (CorpNet) with a unified encoder and a motion-factorized decoder. Specifically, the encoder introduces multi-level self attentions and cross attentions in its main branch to enrich the template and search region features and realize their fusion and interaction, respectively. Additionally, considering the sparsity characteristics of the point clouds, we design a lateral correlation pyramid structure for the encoder to keep as many points as possible by integrating hierarchical correlated features. The output features of the search region from the encoder can be directly fed into the decoder for predicting target locations without any extra matcher. Moreover, in the decoder of CorpNet, we design a motion-factorized head to explicitly learn the different movement patterns of the up axis and the x-y plane together. Extensive experiments on two commonly-used datasets show our CorpNet achieves state-of-the-art results while running in real-time.

Deep learning has revolutionized speech recognition, image recognition, and natural language processing since 2010, each involving a single modality in the input signal. However, many applications in artificial intelligence involve more than one modality. It is therefore of broad interest to study the more difficult and complex problem of modeling and learning across multiple modalities. In this paper, a technical review of the models and learning methods for multimodal intelligence is provided. The main focus is the combination of vision and natural language, which has become an important area in both computer vision and natural language processing research communities. This review provides a comprehensive analysis of recent work on multimodal deep learning from three new angles - learning multimodal representations, the fusion of multimodal signals at various levels, and multimodal applications. On multimodal representation learning, we review the key concept of embedding, which unifies the multimodal signals into the same vector space and thus enables cross-modality signal processing. We also review the properties of the many types of embedding constructed and learned for general downstream tasks. On multimodal fusion, this review focuses on special architectures for the integration of the representation of unimodal signals for a particular task. On applications, selected areas of a broad interest in current literature are covered, including caption generation, text-to-image generation, and visual question answering. We believe this review can facilitate future studies in the emerging field of multimodal intelligence for the community.

Recurrent neural nets (RNN) and convolutional neural nets (CNN) are widely used on NLP tasks to capture the long-term and local dependencies, respectively. Attention mechanisms have recently attracted enormous interest due to their highly parallelizable computation, significantly less training time, and flexibility in modeling dependencies. We propose a novel attention mechanism in which the attention between elements from input sequence(s) is directional and multi-dimensional (i.e., feature-wise). A light-weight neural net, "Directional Self-Attention Network (DiSAN)", is then proposed to learn sentence embedding, based solely on the proposed attention without any RNN/CNN structure. DiSAN is only composed of a directional self-attention with temporal order encoded, followed by a multi-dimensional attention that compresses the sequence into a vector representation. Despite its simple form, DiSAN outperforms complicated RNN models on both prediction quality and time efficiency. It achieves the best test accuracy among all sentence encoding methods and improves the most recent best result by 1.02% on the Stanford Natural Language Inference (SNLI) dataset, and shows state-of-the-art test accuracy on the Stanford Sentiment Treebank (SST), Multi-Genre natural language inference (MultiNLI), Sentences Involving Compositional Knowledge (SICK), Customer Review, MPQA, TREC question-type classification and Subjectivity (SUBJ) datasets.

北京阿比特科技有限公司