亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

Despite significant advancements in Large Vision-Language Models (LVLMs), existing pixel-grounding models operate on single-image settings, limiting their ability to perform detailed, fine-grained comparisons across multiple images. Conversely, current multi-image understanding models lack pixel-level grounding. Our work addresses this gap by introducing the task of multi-image pixel-grounded reasoning segmentation, and PRIMA, a novel LVLM that integrates pixel-level grounding with robust multi-image reasoning capabilities to produce contextually rich, pixel-grounded explanations. Central to PRIMA is an efficient vision module that queries fine-grained visual representations across multiple images, reducing TFLOPs by $25.3\%$. To support training and evaluation, we curate $M^4Seg$, a new reasoning segmentation benchmark consisting of $\sim$224K question-answer pairs that require fine-grained visual understanding across multiple images. Experimental results demonstrate PRIMA outperforms state-of-the-art baselines.

相關內容

ACM/IEEE第23屆模型驅動工程語言和系統國際會議,是模型驅動軟件和系統工程的首要會議系列,由ACM-SIGSOFT和IEEE-TCSE支持組織。自1998年以來,模型涵蓋了建模的各個方面,從語言和方法到工具和應用程序。模特的參加者來自不同的背景,包括研究人員、學者、工程師和工業專業人士。MODELS 2019是一個論壇,參與者可以圍繞建模和模型驅動的軟件和系統交流前沿研究成果和創新實踐經驗。今年的版本將為建模社區提供進一步推進建模基礎的機會,并在網絡物理系統、嵌入式系統、社會技術系統、云計算、大數據、機器學習、安全、開源等新興領域提出建模的創新應用以及可持續性。 官網鏈接: · Performance · Performer · Processing(編程語言) · 語言模型化 ·
2024 年 12 月 27 日
 DeepSeek-AI,Aixin Liu,Bei Feng,Bing Xue,Bingxuan Wang,Bochao Wu,Chengda Lu,Chenggang Zhao,Chengqi Deng,Chenyu Zhang,Chong Ruan,Damai Dai,Daya Guo,Dejian Yang,Deli Chen,Dongjie Ji,Erhang Li,Fangyun Lin,Fucong Dai,Fuli Luo,Guangbo Hao,Guanting Chen,Guowei Li,H. Zhang,Han Bao,Hanwei Xu,Haocheng Wang,Haowei Zhang,Honghui Ding,Huajian Xin,Huazuo Gao,Hui Li,Hui Qu,J. L. Cai,Jian Liang,Jianzhong Guo,Jiaqi Ni,Jiashi Li,Jiawei Wang,Jin Chen,Jingchang Chen,Jingyang Yuan,Junjie Qiu,Junlong Li,Junxiao Song,Kai Dong,Kai Hu,Kaige Gao,Kang Guan,Kexin Huang,Kuai Yu,Lean Wang,Lecong Zhang,Lei Xu,Leyi Xia,Liang Zhao,Litong Wang,Liyue Zhang,Meng Li,Miaojun Wang,Mingchuan Zhang,Minghua Zhang,Minghui Tang,Mingming Li,Ning Tian,Panpan Huang,Peiyi Wang,Peng Zhang,Qiancheng Wang,Qihao Zhu,Qinyu Chen,Qiushi Du,R. J. Chen,R. L. Jin,Ruiqi Ge,Ruisong Zhang,Ruizhe Pan,Runji Wang,Runxin Xu,Ruoyu Zhang,Ruyi Chen,S. S. Li,Shanghao Lu,Shangyan Zhou,Shanhuang Chen,Shaoqing Wu,Shengfeng Ye,Shengfeng Ye,Shirong Ma,Shiyu Wang,Shuang Zhou,Shuiping Yu,Shunfeng Zhou,Shuting Pan,T. Wang,Tao Yun,Tian Pei,Tianyu Sun,W. L. Xiao,Wangding Zeng,Wanjia Zhao,Wei An,Wen Liu,Wenfeng Liang,Wenjun Gao,Wenqin Yu,Wentao Zhang,X. Q. Li,Xiangyue Jin,Xianzu Wang,Xiao Bi,Xiaodong Liu,Xiaohan Wang,Xiaojin Shen,Xiaokang Chen,Xiaokang Zhang,Xiaosha Chen,Xiaotao Nie,Xiaowen Sun,Xiaoxiang Wang,Xin Cheng,Xin Liu,Xin Xie,Xingchao Liu,Xingkai Yu,Xinnan Song,Xinxia Shan,Xinyi Zhou,Xinyu Yang,Xinyuan Li,Xuecheng Su,Xuheng Lin,Y. K. Li,Y. Q. Wang,Y. X. Wei,Y. X. Zhu,Yang Zhang,Yanhong Xu,Yanhong Xu,Yanping Huang,Yao Li,Yao Zhao,Yaofeng Sun,Yaohui Li,Yaohui Wang,Yi Yu,Yi Zheng,Yichao Zhang,Yifan Shi,Yiliang Xiong,Ying He,Ying Tang,Yishi Piao,Yisong Wang,Yixuan Tan,Yiyang Ma,Yiyuan Liu,Yongqiang Guo,Yu Wu,Yuan Ou,Yuchen Zhu,Yuduan Wang,Yue Gong,Yuheng Zou,Yujia He,Yukun Zha,Yunfan Xiong,Yunxian Ma,Yuting Yan,Yuxiang Luo,Yuxiang You,Yuxuan Liu,Yuyang Zhou,Z. F. Wu,Z. Z. Ren,Zehui Ren,Zhangli Sha,Zhe Fu,Zhean Xu,Zhen Huang,Zhen Zhang,Zhenda Xie,Zhengyan Zhang,Zhewen Hao,Zhibin Gou,Zhicheng Ma,Zhigang Yan,Zhihong Shao,Zhipeng Xu,Zhiyu Wu,Zhongyu Zhang,Zhuoshu Li,Zihui Gu,Zijia Zhu,Zijun Liu,Zilin Li,Ziwei Xie,Ziyang Song,Ziyi Gao,Zizheng Pan
 DeepSeek-AI,Aixin Liu,Bei Feng,Bing Xue,Bingxuan Wang,Bochao Wu,Chengda Lu,Chenggang Zhao,Chengqi Deng,Chenyu Zhang,Chong Ruan,Damai Dai,Daya Guo,Dejian Yang,Deli Chen,Dongjie Ji,Erhang Li,Fangyun Lin,Fucong Dai,Fuli Luo,Guangbo Hao,Guanting Chen,Guowei Li,H. Zhang,Han Bao,Hanwei Xu,Haocheng Wang,Haowei Zhang,Honghui Ding,Huajian Xin,Huazuo Gao,Hui Li,Hui Qu,J. L. Cai,Jian Liang,Jianzhong Guo,Jiaqi Ni,Jiashi Li,Jiawei Wang,Jin Chen,Jingchang Chen,Jingyang Yuan,Junjie Qiu,Junlong Li,Junxiao Song,Kai Dong,Kai Hu,Kaige Gao,Kang Guan,Kexin Huang,Kuai Yu,Lean Wang,Lecong Zhang,Lei Xu,Leyi Xia,Liang Zhao,Litong Wang,Liyue Zhang,Meng Li,Miaojun Wang,Mingchuan Zhang,Minghua Zhang,Minghui Tang,Mingming Li,Ning Tian,Panpan Huang,Peiyi Wang,Peng Zhang,Qiancheng Wang,Qihao Zhu,Qinyu Chen,Qiushi Du,R. J. Chen,R. L. Jin,Ruiqi Ge,Ruisong Zhang,Ruizhe Pan,Runji Wang,Runxin Xu,Ruoyu Zhang,Ruyi Chen,S. S. Li,Shanghao Lu,Shangyan Zhou,Shanhuang Chen,Shaoqing Wu,Shengfeng Ye,Shengfeng Ye,Shirong Ma,Shiyu Wang,Shuang Zhou,Shuiping Yu,Shunfeng Zhou,Shuting Pan,T. Wang,Tao Yun,Tian Pei,Tianyu Sun,W. L. Xiao,Wangding Zeng,Wanjia Zhao,Wei An,Wen Liu,Wenfeng Liang,Wenjun Gao,Wenqin Yu,Wentao Zhang,X. Q. Li,Xiangyue Jin,Xianzu Wang,Xiao Bi,Xiaodong Liu,Xiaohan Wang,Xiaojin Shen,Xiaokang Chen,Xiaokang Zhang,Xiaosha Chen,Xiaotao Nie,Xiaowen Sun,Xiaoxiang Wang,Xin Cheng,Xin Liu,Xin Xie,Xingchao Liu,Xingkai Yu,Xinnan Song,Xinxia Shan,Xinyi Zhou,Xinyu Yang,Xinyuan Li,Xuecheng Su,Xuheng Lin,Y. K. Li,Y. Q. Wang,Y. X. Wei,Y. X. Zhu,Yang Zhang,Yanhong Xu,Yanhong Xu,Yanping Huang,Yao Li,Yao Zhao,Yaofeng Sun,Yaohui Li,Yaohui Wang,Yi Yu,Yi Zheng,Yichao Zhang,Yifan Shi,Yiliang Xiong,Ying He,Ying Tang,Yishi Piao,Yisong Wang,Yixuan Tan,Yiyang Ma,Yiyuan Liu,Yongqiang Guo,Yu Wu,Yuan Ou,Yuchen Zhu,Yuduan Wang,Yue Gong,Yuheng Zou,Yujia He,Yukun Zha,Yunfan Xiong,Yunxian Ma,Yuting Yan,Yuxiang Luo,Yuxiang You,Yuxuan Liu,Yuyang Zhou,Z. F. Wu,Z. Z. Ren,Zehui Ren,Zhangli Sha,Zhe Fu,Zhean Xu,Zhen Huang,Zhen Zhang,Zhenda Xie,Zhengyan Zhang,Zhewen Hao,Zhibin Gou,Zhicheng Ma,Zhigang Yan,Zhihong Shao,Zhipeng Xu,Zhiyu Wu,Zhongyu Zhang,Zhuoshu Li,Zihui Gu,Zijia Zhu,Zijun Liu,Zilin Li,Ziwei Xie,Ziyang Song,Ziyi Gao,Zizheng Pan

We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at //github.com/deepseek-ai/DeepSeek-V3.

The rapid advances in Vision Transformer (ViT) refresh the state-of-the-art performances in various vision tasks, overshadowing the conventional CNN-based models. This ignites a few recent striking-back research in the CNN world showing that pure CNN models can achieve as good performance as ViT models when carefully tuned. While encouraging, designing such high-performance CNN models is challenging, requiring non-trivial prior knowledge of network design. To this end, a novel framework termed Mathematical Architecture Design for Deep CNN (DeepMAD) is proposed to design high-performance CNN models in a principled way. In DeepMAD, a CNN network is modeled as an information processing system whose expressiveness and effectiveness can be analytically formulated by their structural parameters. Then a constrained mathematical programming (MP) problem is proposed to optimize these structural parameters. The MP problem can be easily solved by off-the-shelf MP solvers on CPUs with a small memory footprint. In addition, DeepMAD is a pure mathematical framework: no GPU or training data is required during network design. The superiority of DeepMAD is validated on multiple large-scale computer vision benchmark datasets. Notably on ImageNet-1k, only using conventional convolutional layers, DeepMAD achieves 0.7% and 1.5% higher top-1 accuracy than ConvNeXt and Swin on Tiny level, and 0.8% and 0.9% higher on Small level.

Diffusion models (DMs) have shown great potential for high-quality image synthesis. However, when it comes to producing images with complex scenes, how to properly describe both image global structures and object details remains a challenging task. In this paper, we present Frido, a Feature Pyramid Diffusion model performing a multi-scale coarse-to-fine denoising process for image synthesis. Our model decomposes an input image into scale-dependent vector quantized features, followed by a coarse-to-fine gating for producing image output. During the above multi-scale representation learning stage, additional input conditions like text, scene graph, or image layout can be further exploited. Thus, Frido can be also applied for conditional or cross-modality image synthesis. We conduct extensive experiments over various unconditioned and conditional image generation tasks, ranging from text-to-image synthesis, layout-to-image, scene-graph-to-image, to label-to-image. More specifically, we achieved state-of-the-art FID scores on five benchmarks, namely layout-to-image on COCO and OpenImages, scene-graph-to-image on COCO and Visual Genome, and label-to-image on COCO. Code is available at //github.com/davidhalladay/Frido.

Existing knowledge graph (KG) embedding models have primarily focused on static KGs. However, real-world KGs do not remain static, but rather evolve and grow in tandem with the development of KG applications. Consequently, new facts and previously unseen entities and relations continually emerge, necessitating an embedding model that can quickly learn and transfer new knowledge through growth. Motivated by this, we delve into an expanding field of KG embedding in this paper, i.e., lifelong KG embedding. We consider knowledge transfer and retention of the learning on growing snapshots of a KG without having to learn embeddings from scratch. The proposed model includes a masked KG autoencoder for embedding learning and update, with an embedding transfer strategy to inject the learned knowledge into the new entity and relation embeddings, and an embedding regularization method to avoid catastrophic forgetting. To investigate the impacts of different aspects of KG growth, we construct four datasets to evaluate the performance of lifelong KG embedding. Experimental results show that the proposed model outperforms the state-of-the-art inductive and lifelong embedding baselines.

We present CoDEx, a set of knowledge graph completion datasets extracted from Wikidata and Wikipedia that improve upon existing knowledge graph completion benchmarks in scope and level of difficulty. In terms of scope, CoDEx comprises three knowledge graphs varying in size and structure, multilingual descriptions of entities and relations, and tens of thousands of hard negative triples that are plausible but verified to be false. To characterize CoDEx, we contribute thorough empirical analyses and benchmarking experiments. First, we analyze each CoDEx dataset in terms of logical relation patterns. Next, we report baseline link prediction and triple classification results on CoDEx for five extensively tuned embedding models. Finally, we differentiate CoDEx from the popular FB15K-237 knowledge graph completion dataset by showing that CoDEx covers more diverse and interpretable content, and is a more difficult link prediction benchmark. Data, code, and pretrained models are available at //bit.ly/2EPbrJs.

We present CURL: Contrastive Unsupervised Representations for Reinforcement Learning. CURL extracts high-level features from raw pixels using contrastive learning and performs off-policy control on top of the extracted features. CURL outperforms prior pixel-based methods, both model-based and model-free, on complex tasks in the DeepMind Control Suite and Atari Games showing 1.9x and 1.6x performance gains at the 100K environment and interaction steps benchmarks respectively. On the DeepMind Control Suite, CURL is the first image-based algorithm to nearly match the sample-efficiency and performance of methods that use state-based features.

We propose to pre-train a unified language model for both autoencoding and partially autoregressive language modeling tasks using a novel training procedure, referred to as a pseudo-masked language model (PMLM). Given an input text with masked tokens, we rely on conventional masks to learn inter-relations between corrupted tokens and context via autoencoding, and pseudo masks to learn intra-relations between masked spans via partially autoregressive modeling. With well-designed position embeddings and self-attention masks, the context encodings are reused to avoid redundant computation. Moreover, conventional masks used for autoencoding provide global masking information, so that all the position embeddings are accessible in partially autoregressive language modeling. In addition, the two tasks pre-train a unified language model as a bidirectional encoder and a sequence-to-sequence decoder, respectively. Our experiments show that the unified language models pre-trained using PMLM achieve new state-of-the-art results on a wide range of natural language understanding and generation tasks across several widely used benchmarks.

Convolutional neural networks (CNNs) have shown dramatic improvements in single image super-resolution (SISR) by using large-scale external samples. Despite their remarkable performance based on the external dataset, they cannot exploit internal information within a specific image. Another problem is that they are applicable only to the specific condition of data that they are supervised. For instance, the low-resolution (LR) image should be a "bicubic" downsampled noise-free image from a high-resolution (HR) one. To address both issues, zero-shot super-resolution (ZSSR) has been proposed for flexible internal learning. However, they require thousands of gradient updates, i.e., long inference time. In this paper, we present Meta-Transfer Learning for Zero-Shot Super-Resolution (MZSR), which leverages ZSSR. Precisely, it is based on finding a generic initial parameter that is suitable for internal learning. Thus, we can exploit both external and internal information, where one single gradient update can yield quite considerable results. (See Figure 1). With our method, the network can quickly adapt to a given image condition. In this respect, our method can be applied to a large spectrum of image conditions within a fast adaptation process.

Generative Adversarial Networks (GANs) have recently achieved impressive results for many real-world applications, and many GAN variants have emerged with improvements in sample quality and training stability. However, they have not been well visualized or understood. How does a GAN represent our visual world internally? What causes the artifacts in GAN results? How do architectural choices affect GAN learning? Answering such questions could enable us to develop new insights and better models. In this work, we present an analytic framework to visualize and understand GANs at the unit-, object-, and scene-level. We first identify a group of interpretable units that are closely related to object concepts using a segmentation-based network dissection method. Then, we quantify the causal effect of interpretable units by measuring the ability of interventions to control objects in the output. We examine the contextual relationship between these units and their surroundings by inserting the discovered object concepts into new images. We show several practical applications enabled by our framework, from comparing internal representations across different layers, models, and datasets, to improving GANs by locating and removing artifact-causing units, to interactively manipulating objects in a scene. We provide open source interpretation tools to help researchers and practitioners better understand their GAN models.

We propose a novel single shot object detection network named Detection with Enriched Semantics (DES). Our motivation is to enrich the semantics of object detection features within a typical deep detector, by a semantic segmentation branch and a global activation module. The segmentation branch is supervised by weak segmentation ground-truth, i.e., no extra annotation is required. In conjunction with that, we employ a global activation module which learns relationship between channels and object classes in a self-supervised manner. Comprehensive experimental results on both PASCAL VOC and MS COCO detection datasets demonstrate the effectiveness of the proposed method. In particular, with a VGG16 based DES, we achieve an mAP of 81.7 on VOC2007 test and an mAP of 32.8 on COCO test-dev with an inference speed of 31.5 milliseconds per image on a Titan Xp GPU. With a lower resolution version, we achieve an mAP of 79.7 on VOC2007 with an inference speed of 13.0 milliseconds per image.

北京阿比特科技有限公司