亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

Feature extraction and matching are the basic parts of many robotic vision tasks, such as 2D or 3D object detection, recognition, and registration. As known, 2D feature extraction and matching have already been achieved great success. Unfortunately, in the field of 3D, the current methods fail to support the extensive application of 3D LiDAR sensors in robotic vision tasks, due to the poor descriptiveness and inefficiency. To address this limitation, we propose a novel 3D feature representation method: Linear Keypoints representation for 3D LiDAR point cloud, called LinK3D. The novelty of LinK3D lies in that it fully considers the characteristics (such as the sparsity, and complexity of scenes) of LiDAR point clouds, and represents the keypoint with its robust neighbor keypoints, which provide strong distinction in the description of the keypoint. The proposed LinK3D has been evaluated on two public datasets (i.e., KITTI, Steven VLP16), and the experimental results show that our method greatly outperforms the state-of-the-art in matching performance. More importantly, LinK3D shows excellent real-time performance, faster than the sensor frame rate at 10 Hz of a typical rotating LiDAR sensor. LinK3D only takes an average of 32 milliseconds to extract features from the point cloud collected by a 64-beam LiDAR, and takes merely about 8 milliseconds to match two LiDAR scans when executed in a notebook with an Intel Core i7 @2.2 GHz processor. Moreover, our method can be widely extended to various 3D vision applications. In this paper, we apply the proposed LinK3D to the LiDAR odometry and place recognition task of LiDAR SLAM. The experimental results show that our method can improve the efficiency and accuracy of LiDAR SLAM system.

相關內容

As AI/ML models, including Large Language Models, continue to scale with massive datasets, so does their consumption of undeniably limited natural resources, and impact on society. In this collaboration between AI, Sustainability, HCI and legal researchers, we aim to enable a transition to sustainable AI development by enabling stakeholders across the AI value chain to assess and quantitfy the environmental and societal impact of AI. We present the ESG Digital and Green Index (DGI), which offers a dashboard for assessing a company's performance in achieving sustainability targets. This includes monitoring the efficiency and sustainable use of limited natural resources related to AI technologies (water, electricity, etc). It also addresses the societal and governance challenges related to AI. The DGI creates incentives for companies to align their pathway with the Sustainable Development Goals (SDGs). The value, challenges and limitations of our methodology and findings are discussed in the paper.

Graphic designers often get inspiration through the recombination of references. Our formative study (N=6) reveals that graphic designers focus on conceptual keywords during this process, and want support for discovering the keywords, expanding them, and exploring diverse recombination options of them, while still having room for their creativity. We propose CreativeConnect, a system with generative AI pipelines that helps users discover useful elements from the reference image using keywords, recommends relevant keywords, generates diverse recombination options with user-selected keywords, and shows recombinations as sketches with text descriptions. Our user study (N=16) showed that CreativeConnect helped users discover keywords from the reference and generate multiple ideas based on them, ultimately helping users produce more design ideas and higher self-reported creativity, compared to the baseline system without generative pipelines. While CreativeConnect was effective in ideation, we discussed how CreativeConnect can be extended to support other types of tasks in creativity support.

Despite the impressive results of arbitrary image-guided style transfer methods, text-driven image stylization has recently been proposed for transferring a natural image into a stylized one according to textual descriptions of the target style provided by the user. Unlike the previous image-to-image transfer approaches, text-guided stylization progress provides users with a more precise and intuitive way to express the desired style. However, the huge discrepancy between cross-modal inputs/outputs makes it challenging to conduct text-driven image stylization in a typical feed-forward CNN pipeline. In this paper, we present DiffStyler, a dual diffusion processing architecture to control the balance between the content and style of the diffused results. The cross-modal style information can be easily integrated as guidance during the diffusion process step-by-step. Furthermore, we propose a content image-based learnable noise on which the reverse denoising process is based, enabling the stylization results to better preserve the structure information of the content image. We validate the proposed DiffStyler beyond the baseline methods through extensive qualitative and quantitative experiments. Code is available at \url{//github.com/haha-lisa/Diffstyler}.

Image prefiltering with just noticeable distortion (JND) improves coding efficiency in a visual lossless way by filtering the perceptually redundant information prior to compression. However, real JND cannot be well modeled with inaccurate masking equations in traditional approaches or image-level subject tests in deep learning approaches. Thus, this paper proposes a fine-grained JND prefiltering dataset guided by image quality assessment for accurate block-level JND modeling. The dataset is constructed from decoded images to include coding effects and is also perceptually enhanced with block overlap and edge preservation. Furthermore, based on this dataset, we propose a lightweight JND prefiltering network, IQNet, which can be applied directly to different quantization cases with the same model and only needs 3K parameters. The experimental results show that the proposed approach to Versatile Video Coding could yield maximum/average bitrate savings of 41\%/15\% and 53\%/19\% for all-intra and low-delay P configurations, respectively, with negligible subjective quality loss. Our method demonstrates higher perceptual quality and a model size that is an order of magnitude smaller than previous deep learning methods.

Interpretability methods are developed to understand the working mechanisms of black-box models, which is crucial to their responsible deployment. Fulfilling this goal requires both that the explanations generated by these methods are correct and that people can easily and reliably understand them. While the former has been addressed in prior work, the latter is often overlooked, resulting in informal model understanding derived from a handful of local explanations. In this paper, we introduce explanation summary (ExSum), a mathematical framework for quantifying model understanding, and propose metrics for its quality assessment. On two domains, ExSum highlights various limitations in the current practice, helps develop accurate model understanding, and reveals easily overlooked properties of the model. We also connect understandability to other properties of explanations such as human alignment, robustness, and counterfactual minimality and plausibility.

Conventionally, spatiotemporal modeling network and its complexity are the two most concentrated research topics in video action recognition. Existing state-of-the-art methods have achieved excellent accuracy regardless of the complexity meanwhile efficient spatiotemporal modeling solutions are slightly inferior in performance. In this paper, we attempt to acquire both efficiency and effectiveness simultaneously. First of all, besides traditionally treating H x W x T video frames as space-time signal (viewing from the Height-Width spatial plane), we propose to also model video from the other two Height-Time and Width-Time planes, to capture the dynamics of video thoroughly. Secondly, our model is designed based on 2D CNN backbones and model complexity is well kept in mind by design. Specifically, we introduce a novel multi-view fusion (MVF) module to exploit video dynamics using separable convolution for efficiency. It is a plug-and-play module and can be inserted into off-the-shelf 2D CNNs to form a simple yet effective model called MVFNet. Moreover, MVFNet can be thought of as a generalized video modeling framework and it can specialize to be existing methods such as C2D, SlowOnly, and TSM under different settings. Extensive experiments are conducted on popular benchmarks (i.e., Something-Something V1 & V2, Kinetics, UCF-101, and HMDB-51) to show its superiority. The proposed MVFNet can achieve state-of-the-art performance with 2D CNN's complexity.

Visual dialogue is a challenging task that needs to extract implicit information from both visual (image) and textual (dialogue history) contexts. Classical approaches pay more attention to the integration of the current question, vision knowledge and text knowledge, despising the heterogeneous semantic gaps between the cross-modal information. In the meantime, the concatenation operation has become de-facto standard to the cross-modal information fusion, which has a limited ability in information retrieval. In this paper, we propose a novel Knowledge-Bridge Graph Network (KBGN) model by using graph to bridge the cross-modal semantic relations between vision and text knowledge in fine granularity, as well as retrieving required knowledge via an adaptive information selection mode. Moreover, the reasoning clues for visual dialogue can be clearly drawn from intra-modal entities and inter-modal bridges. Experimental results on VisDial v1.0 and VisDial-Q datasets demonstrate that our model outperforms exiting models with state-of-the-art results.

The design of deep graph models still remains to be investigated and the crucial part is how to explore and exploit the knowledge from different hops of neighbors in an efficient way. In this paper, we propose a novel RNN-like deep graph neural network architecture by incorporating AdaBoost into the computation of network; and the proposed graph convolutional network called AdaGCN~(AdaBoosting Graph Convolutional Network) has the ability to efficiently extract knowledge from high-order neighbors and integrate knowledge from different hops of neighbors into the network in an AdaBoost way. We also present the architectural difference between AdaGCN and existing graph convolutional methods to show the benefits of our proposal. Finally, extensive experiments demonstrate the state-of-the-art prediction performance and the computational advantage of our approach AdaGCN.

With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into pretraining. Empirically, XLNet outperforms BERT on 20 tasks, often by a large margin, and achieves state-of-the-art results on 18 tasks including question answering, natural language inference, sentiment analysis, and document ranking.

Distant supervision can effectively label data for relation extraction, but suffers from the noise labeling problem. Recent works mainly perform soft bag-level noise reduction strategies to find the relatively better samples in a sentence bag, which is suboptimal compared with making a hard decision of false positive samples in sentence level. In this paper, we introduce an adversarial learning framework, which we named DSGAN, to learn a sentence-level true-positive generator. Inspired by Generative Adversarial Networks, we regard the positive samples generated by the generator as the negative samples to train the discriminator. The optimal generator is obtained until the discrimination ability of the discriminator has the greatest decline. We adopt the generator to filter distant supervision training dataset and redistribute the false positive instances into the negative set, in which way to provide a cleaned dataset for relation classification. The experimental results show that the proposed strategy significantly improves the performance of distant supervision relation extraction comparing to state-of-the-art systems.

北京阿比特科技有限公司