一级a视频免费一区二区_99国产精品久久久久99_久久久精品麻豆一区二区三区_豪妇荡乳一级婬片免费看_亚洲国产欧美精品中文字幕_国产与亚洲视频最新美女_亚洲午夜精品一区二区三区百度

Recent advances in Multi-modal Large Language Models (MLLMs), such as LLaVA-series models, are driven by massive machine-generated instruction-following data tuning. Such automatic instruction collection pipelines, however, inadvertently introduce significant variability in data quality. This paper introduces a novel instruction curation algorithm, derived from two unique perspectives, human and LLM preference alignment, to compress this vast corpus of machine-generated multimodal instructions to a compact and high-quality form: (i) For human preference alignment, we have collected a machine-generated multimodal instruction dataset and established a comprehensive set of both subjective and objective criteria to guide the data quality assessment critically from human experts. By doing so, a reward model was trained on the annotated dataset to internalize the nuanced human understanding of instruction alignment. (ii) For LLM preference alignment, given the instruction selected by the reward model, we propose leveraging the inner LLM used in MLLM to align the writing style of visual instructions with that of the inner LLM itself, resulting in LLM-aligned instruction improvement. Extensive experiments demonstrate that we can maintain or even improve model performance by compressing synthetic multimodal instructions by up to 90%. Impressively, by aggressively reducing the total training sample size from 158k to 14k (9$\times$ smaller), our model consistently outperforms its full-size dataset counterpart across various MLLM benchmarks. Our project is available at //github.com/DCDmllm/Align2LLaVA.

相關內容

MoDELS

關注 43

ACM/IEEE第23屆模型驅動工程語言和系統國際會議，是模型驅動軟件和系統工程的首要會議系列，由ACM-SIGSOFT和IEEE-TCSE支持組織。自1998年以來，模型涵蓋了建模的各個方面，從語言和方法到工具和應用程序。模特的參加者來自不同的背景，包括研究人員、學者、工程師和工業專業人士。MODELS 2019是一個論壇，參與者可以圍繞建模和模型驅動的軟件和系統交流前沿研究成果和創新實踐經驗。今年的版本將為建模社區提供進一步推進建模基礎的機會，并在網絡物理系統、嵌入式系統、社會技術系統、云計算、大數據、機器學習、安全、開源等新興領域提出建模的創新應用以及可持續性。官網鏈接： · MoDELS · 控制器 · 泛化理論 · 生成模型 ·

2024 年 11 月 6 日

Gaussian Deja-vu: Creating Controllable 3D Gaussian Head-Avatars with Enhanced Generalization and Personalization Abilities

Peizhi Yan,Rabab Ward,Qiang Tang,Shan Du

from arxiv, 11 pages, Accepted by WACV 2025 in Round 1

Recent advancements in 3D Gaussian Splatting (3DGS) have unlocked significant potential for modeling 3D head avatars, providing greater flexibility than mesh-based methods and more efficient rendering compared to NeRF-based approaches. Despite these advancements, the creation of controllable 3DGS-based head avatars remains time-intensive, often requiring tens of minutes to hours. To expedite this process, we here introduce the "Gaussian Deja-vu" framework, which first obtains a generalized model of the head avatar and then personalizes the result. The generalized model is trained on large 2D (synthetic and real) image datasets. This model provides a well-initialized 3D Gaussian head that is further refined using a monocular video to achieve the personalized head avatar. For personalizing, we propose learnable expression-aware rectification blendmaps to correct the initial 3D Gaussians, ensuring rapid convergence without the reliance on neural networks. Experiments demonstrate that the proposed method meets its objectives. It outperforms state-of-the-art 3D Gaussian head avatars in terms of photorealistic quality as well as reduces training time consumption to at least a quarter of the existing methods, producing the avatar in minutes.

自動問答 · 視覺問答 · MoDELS · Performer · 得分 ·

2024 年 11 月 6 日

VQA$^2$:Visual Question Answering for Video Quality Assessment

Ziheng Jia,Zicheng Zhang,Jiaying Qian,Haoning Wu,Wei Sun,Chunyi Li,Xiaohong Liu,Weisi Lin,Guangtao Zhai,Xiongkuo Min

from arxiv, 10 pages 3 figures

The advent and proliferation of large multi-modal models (LMMs) have introduced a new paradigm to video-related computer vision fields, including training and inference methods based on visual question answering (VQA). These methods enable models to handle multiple downstream tasks robustly. Video Quality Assessment (VQA), a classic field in low-level visual quality evaluation, originally focused on quantitative video quality scoring. However, driven by advances in LMMs, it is now evolving towards more comprehensive visual quality understanding tasks. Visual question answering has significantly improved low-level visual evaluation within the image domain recently. However, related work is almost nonexistent in the video domain, leaving substantial room for improvement. To address this gap, we introduce the VQA2 Instruction Dataset the first visual question answering instruction dataset entirely focuses on video quality assessment, and based on it, we propose the VQA2 series models The VQA2 Instruction Dataset consists of three stages and covers various video types, containing 157,735 instruction question-answer pairs, including both manually annotated and synthetic data. We conduct extensive experiments on both video quality scoring and video quality understanding tasks. Results demonstrate that the VQA2 series models achieve state-of-the-art (SOTA) performance in quality scoring tasks, and their performance in visual quality question answering surpasses the renowned GPT-4o. Additionally, our final model, the VQA2-Assistant, performs well across both scoring and question-answering tasks, validating its versatility.

曲率 · 總回報 · PCA · 相互獨立的 · 同分布的 ·

2024 年 11 月 6 日

Zero-Coupon Treasury Yield Curve with VIX as Stochastic Volatility

Jihyun Park,Andrey Sarantsev

from arxiv, 13 pages, 2 figures. Keywords: total returns, Ornstein-Uhlenbeck process, ergodic Markov processes, autoregression, long-term stability, stationary distribution, principal component analysis

We apply Principal Component Analysis for zero-coupon Treasury bonds to get level, slope, and curvature series. We model these as autoregressions of order 1, and analyze their innovations. For slope, but not for level and curvature, dividing these innovations by the Volatility Index VIX made for Standard \& Poor 500 makes them closer to independent identically distributed normal. We state and prove stability results for bond returns based on this observation. We chose zero-coupon as opposed to classic coupon Treasury bonds because it is much easier to compute returns for these.

3D · MoDELS · ASSETS · 近似 · INFORMS ·

2024 年 11 月 5 日

Tencent Hunyuan3D-1.0: A Unified Framework for Text-to-3D and Image-to-3D Generation

Xianghui Yang,Huiwen Shi,Bowen Zhang,Fan Yang,Jiacheng Wang,Hongxu Zhao,Xinhai Liu,Xinzhou Wang,Qingxiang Lin,Jiaao Yu,Lifu Wang,Zhuo Chen,Sicong Liu,Yuhong Liu,Yong Yang,Di Wang,Jie Jiang,Chunchao Guo

from arxiv, Technical Report; 3D Generation

While 3D generative models have greatly improved artists' workflows, the existing diffusion models for 3D generation suffer from slow generation and poor generalization. To address this issue, we propose a two-stage approach named Hunyuan3D-1.0 including a lite version and a standard version, that both support text- and image-conditioned generation. In the first stage, we employ a multi-view diffusion model that efficiently generates multi-view RGB in approximately 4 seconds. These multi-view images capture rich details of the 3D asset from different viewpoints, relaxing the tasks from single-view to multi-view reconstruction. In the second stage, we introduce a feed-forward reconstruction model that rapidly and faithfully reconstructs the 3D asset given the generated multi-view images in approximately 7 seconds. The reconstruction network learns to handle noises and in-consistency introduced by the multi-view diffusion and leverages the available information from the condition image to efficiently recover the 3D structure. Our framework involves the text-to-image model, i.e., Hunyuan-DiT, making it a unified framework to support both text- and image-conditioned 3D generation. Our standard version has 3x more parameters than our lite and other existing model. Our Hunyuan3D-1.0 achieves an impressive balance between speed and quality, significantly reducing generation time while maintaining the quality and diversity of the produced assets.

泛化理論 · 可辨認的 · Networking · Performer · 查準率/準確率 ·

2024 年 11 月 5 日

An Interpretable Generalization Mechanism for Accurately Detecting Anomaly and Identifying Networking Intrusion Techniques

Hao-Ting Pai,Yu-Hsuan Kang,Wen-Cheng Chung

Recent advancements in Intrusion Detection Systems (IDS), integrating Explainable AI (XAI) methodologies, have led to notable improvements in system performance via precise feature selection. However, a thorough understanding of cyber-attacks requires inherently explainable decision-making processes within IDS. In this paper, we present the Interpretable Generalization Mechanism (IG), poised to revolutionize IDS capabilities. IG discerns coherent patterns, making it interpretable in distinguishing between normal and anomalous network traffic. Further, the synthesis of coherent patterns sheds light on intricate intrusion pathways, providing essential insights for cybersecurity forensics. By experiments with real-world datasets NSL-KDD, UNSW-NB15, and UKM-IDS20, IG is accurate even at a low ratio of training-to-test. With 10%-to-90%, IG achieves Precision (PRE)=0.93, Recall (REC)=0.94, and Area Under Curve (AUC)=0.94 in NSL-KDD; PRE=0.98, REC=0.99, and AUC=0.99 in UNSW-NB15; and PRE=0.98, REC=0.98, and AUC=0.99 in UKM-IDS20. Notably, in UNSW-NB15, IG achieves REC=1.0 and at least PRE=0.98 since 40%-to-60%; in UKM-IDS20, IG achieves REC=1.0 and at least PRE=0.88 since 20%-to-80%. Importantly, in UKM-IDS20, IG successfully identifies all three anomalous instances without prior exposure, demonstrating its generalization capabilities. These results and inferences are reproducible. In sum, IG showcases superior generalization by consistently performing well across diverse datasets and training-to-test ratios (from 10%-to-90% to 90%-to-10%), and excels in identifying novel anomalies without prior exposure. Its interpretability is enhanced by coherent evidence that accurately distinguishes both normal and anomalous activities, significantly improving detection accuracy and reducing false alarms, thereby strengthening IDS reliability and trustworthiness.

3D · MoDELS · ASSETS · 近似 · INFORMS ·

2024 年 11 月 4 日

Hunyuan3D-1.0: A Unified Framework for Text-to-3D and Image-to-3D Generation

While 3D generative models have greatly improved artists' workflows, the existing diffusion models for 3D generation suffer from slow generation and poor generalization. To address this issue, we propose a two-stage approach named Hunyuan3D-1.0 including a lite version and a standard version, that both support text- and image-conditioned generation. In the first stage, we employ a multi-view diffusion model that efficiently generates multi-view RGB in approximately 4 seconds. These multi-view images capture rich details of the 3D asset from different viewpoints, relaxing the tasks from single-view to multi-view reconstruction. In the second stage, we introduce a feed-forward reconstruction model that rapidly and faithfully reconstructs the 3D asset given the generated multi-view images in approximately 7 seconds. The reconstruction network learns to handle noises and in-consistency introduced by the multi-view diffusion and leverages the available information from the condition image to efficiently recover the 3D structure. % Extensive experimental results demonstrate the effectiveness of Hunyuan3D-1.0 in generating high-quality 3D assets. Our framework involves the text-to-image model ~\ie, Hunyuan-DiT, making it a unified framework to support both text- and image-conditioned 3D generation. Our standard version has $10\times$ more parameters than our lite and other existing model. Our Hunyuan3D-1.0 achieves an impressive balance between speed and quality, significantly reducing generation time while maintaining the quality and diversity of the produced assets.

估計/估計量 · 類別 · 離散化 · 優化器 · 分解的 ·

2024 年 11 月 4 日

On a Non-Uniform $α$-Robust IMEX-L1 Mixed FEM for Time-Fractional PIDEs

Lok Pati Tripathi,Aditi Tomar,Amiya K. Pani

A non-uniform implicit-explicit L1 mixed finite element method (IMEX-L1-MFEM) is investigated for a class of time-fractional partial integro-differential equations (PIDEs) with space-time dependent coefficients and non-self-adjoint elliptic part. The proposed fully discrete method combines an IMEX-L1 method on a graded mesh in the temporal variable with a mixed finite element method in spatial variables. The focus of the study is to analyze stability results and to establish optimal error estimates, up to a logarithmic factor, for both the solution and the flux in $L^2$-norm when the initial data $u_0\in H_0^1(\Omega)\cap H^2(\Omega)$. Additionally, an error estimate in $L^\infty$-norm is derived for 2D problems. All the derived estimates and bounds in this article remain valid as $\alpha\to 1^{-}$, where $\alpha$ is the order of the Caputo fractional derivative. Finally, the results of several numerical experiments conducted at the end of this paper are confirming our theoretical findings.

MoDELS · 機器人 · 控制器 · Learning · 全 ·

2024 年 11 月 2 日

$π_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black,Noah Brown,Danny Driess,Adnan Esmail,Michael Equi,Chelsea Finn,Niccolo Fusai,Lachy Groom,Karol Hausman,Brian Ichter,Szymon Jakubczak,Tim Jones,Liyiming Ke,Sergey Levine,Adrian Li-Bell,Mohith Mothukuri,Suraj Nair,Karl Pertsch,Lucy Xiaoyang Shi,James Tanner,Quan Vuong,Anna Walling,Haohuan Wang,Ury Zhilinsky

from arxiv, See project website for videos: //physicalintelligence.company/blog/pi0

Robot learning holds tremendous promise to unlock the full potential of flexible, general, and dexterous robot systems, as well as to address some of the deepest questions in artificial intelligence. However, bringing robot learning to the level of generality required for effective real-world systems faces major obstacles in terms of data, generalization, and robustness. In this paper, we discuss how generalist robot policies (i.e., robot foundation models) can address these challenges, and how we can design effective generalist robot policies for complex and highly dexterous tasks. We propose a novel flow matching architecture built on top of a pre-trained vision-language model (VLM) to inherit Internet-scale semantic knowledge. We then discuss how this model can be trained on a large and diverse dataset from multiple dexterous robot platforms, including single-arm robots, dual-arm robots, and mobile manipulators. We evaluate our model in terms of its ability to perform tasks in zero shot after pre-training, follow language instructions from people and from a high-level VLM policy, and its ability to acquire new skills via fine-tuning. Our results cover a wide variety of tasks, such as laundry folding, table cleaning, and assembling boxes.

MoDELS · INFORMS · 分解的 · 推薦系統 · 剪枝 ·

2021 年 2 月 20 日

$FM^2$: Field-matrixed Factorization Machines for Recommender Systems

Yang Sun,Junwei Pan,Alex Zhang,Aaron Flores

from arxiv, In Proceedings of the Web Conference 2021 (WWW 2021), April 19-23, 2021, Ljubljana, Slovenia. 10 pages

Click-through rate (CTR) prediction plays a critical role in recommender systems and online advertising. The data used in these applications are multi-field categorical data, where each feature belongs to one field. Field information is proved to be important and there are several works considering fields in their models. In this paper, we proposed a novel approach to model the field information effectively and efficiently. The proposed approach is a direct improvement of FwFM, and is named as Field-matrixed Factorization Machines (FmFM, or $FM^2$). We also proposed a new explanation of FM and FwFM within the FmFM framework, and compared it with the FFM. Besides pruning the cross terms, our model supports field-specific variable dimensions of embedding vectors, which acts as soft pruning. We also proposed an efficient way to minimize the dimension while keeping the model performance. The FmFM model can also be optimized further by caching the intermediate vectors, and it only takes thousands of floating-point operations (FLOPs) to make a prediction. Our experiment results show that it can out-perform the FFM, which is more complex. The FmFM model's performance is also comparable to DNN models which require much more FLOPs in runtime.

圖卷積神經網絡/圖卷積網絡 · 圖 · 圖卷積 · 圖卷積網絡 · 學成 ·

2020 年 3 月 30 日

L^2-GCN: Layer-Wise and Learned Efficient Training of Graph Convolutional Networks

Yuning You,Tianlong Chen,Zhangyang Wang,Yang Shen

from arxiv, CVPR 2020

Graph convolution networks (GCN) are increasingly popular in many applications, yet remain notoriously hard to train over large graph datasets. They need to compute node representations recursively from their neighbors. Current GCN training algorithms suffer from either high computational costs that grow exponentially with the number of layers, or high memory usage for loading the entire graph and node embeddings. In this paper, we propose a novel efficient layer-wise training framework for GCN (L-GCN), that disentangles feature aggregation and feature transformation during training, hence greatly reducing time and memory complexities. We present theoretical analysis for L-GCN under the graph isomorphism framework, that L-GCN leads to as powerful GCNs as the more costly conventional training algorithm does, under mild conditions. We further propose L^2-GCN, which learns a controller for each layer that can automatically adjust the training epochs per layer in L-GCN. Experiments show that L-GCN is faster than state-of-the-arts by at least an order of magnitude, with a consistent of memory usage not dependent on dataset size, while maintaining comparable prediction performance. With the learned controller, L^2-GCN can further cut the training time in half. Our codes are available at //github.com/Shen-Lab/L2-GCN.