亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

This work presents DocPedia, a novel large multimodal model (LMM) for versatile OCR-free document understanding, capable of parsing images up to 2,560$\times$2,560 resolution. Unlike existing work either struggle with high-resolution documents or give up the large language model thus vision or language ability constrained, our DocPedia directly processes visual input in the frequency domain rather than the pixel space. The unique characteristic enables DocPedia to capture a greater amount of visual and textual information using a limited number of visual tokens. To consistently enhance both perception and comprehension abilities of our model, we develop a dual-stage training strategy and enrich instructions/annotations of all training tasks covering multiple document types. Extensive quantitative and qualitative experiments conducted on various publicly available benchmarks confirm the mutual benefits of jointly learning perception and comprehension tasks. The results provide further evidence of the effectiveness and superior performance of our DocPedia over other methods.

相關內容

Entity alignment (EA), a pivotal process in integrating multi-source Knowledge Graphs (KGs), seeks to identify equivalent entity pairs across these graphs. Most existing approaches regard EA as a graph representation learning task, concentrating on enhancing graph encoders. However, the decoding process in EA - essential for effective operation and alignment accuracy - has received limited attention and remains tailored to specific datasets and model architectures, necessitating both entity and additional explicit relation embeddings. This specificity limits its applicability, particularly in GNN-based models. To address this gap, we introduce a novel, generalized, and efficient decoding approach for EA, relying solely on entity embeddings. Our method optimizes the decoding process by minimizing Dirichlet energy, leading to the gradient flow within the graph, to promote graph homophily. The discretization of the gradient flow produces a fast and scalable approach, termed Triple Feature Propagation (TFP). TFP innovatively channels gradient flow through three views: entity-to-entity, entity-to-relation, and relation-to-entity. This generalized gradient flow enables TFP to harness the multi-view structural information of KGs. Rigorous experimentation on diverse real-world datasets demonstrates that our approach significantly enhances various EA methods. Notably, the approach achieves these advancements with less than 6 seconds of additional computational time, establishing a new benchmark in efficiency and adaptability for future EA methods.

In this paper, we present ECL, a novel multi-modal dataset containing the textual and numerical data from corporate 10K filings and associated binary bankruptcy labels. Furthermore, we develop and critically evaluate several classical and neural bankruptcy prediction models using this dataset. Our findings suggest that the information contained in each data modality is complementary for bankruptcy prediction. We also see that the binary bankruptcy prediction target does not enable our models to distinguish next year bankruptcy from an unhealthy financial situation resulting in bankruptcy in later years. Finally, we explore the use of LLMs in the context of our task. We show how GPT-based models can be used to extract meaningful summaries from the textual data but zero-shot bankruptcy prediction results are poor. All resources required to access and update the dataset or replicate our experiments are available on github.com/henriarnoUG/ECL.

We present VOICE, a novel approach to science communication that connects large language models' (LLM) conversational capabilities with interactive exploratory visualization. VOICE introduces several innovative technical contributions that drive our conversational visualization framework. Our foundation is a pack-of-bots that can perform specific tasks, such as assigning tasks, extracting instructions, and generating coherent content. We employ fine-tuning and prompt engineering techniques to tailor bots' performance to their specific roles and accurately respond to user queries. Our interactive text-to-visualization method generates a flythrough sequence matching the content explanation. Besides, natural language interaction provides capabilities to navigate and manipulate the 3D models in real-time. The VOICE framework can receive arbitrary voice commands from the user and respond verbally, tightly coupled with corresponding visual representation with low latency and high accuracy. We demonstrate the effectiveness of our approach by applying it to the molecular visualization domain: analyzing three 3D molecular models with multi-scale and multi-instance attributes. We finally evaluate VOICE with the identified educational experts to show the potential of our approach. All supplemental materials are available at //osf.io/g7fbr.

In this paper, we describe VivesDebate-Speech, a corpus of spoken argumentation created to leverage audio features for argument mining tasks. The creation of this corpus represents an important contribution to the intersection of speech processing and argument mining communities, and one of the most complete publicly available resources in this topic. Moreover, we have performed a set of first-of-their-kind experiments which show an improvement when integrating audio features into the argument mining pipeline. The provided results can be used as a baseline for future research.

This paper introduces BWSNet, a model that can be trained from raw human judgements obtained through a Best-Worst scaling (BWS) experiment. It maps sound samples into an embedded space that represents the perception of a studied attribute. To this end, we propose a set of cost functions and constraints, interpreting trial-wise ordinal relations as distance comparisons in a metric learning task. We tested our proposal on data from two BWS studies investigating the perception of speech social attitudes and timbral qualities. For both datasets, our results show that the structure of the latent space is faithful to human judgements.

While both the database and high-performance computing (HPC) communities utilize lossless compression methods to minimize floating-point data size, a disconnect persists between them. Each community designs and assesses methods in a domain-specific manner, making it unclear if HPC compression techniques can benefit database applications or vice versa. With the HPC community increasingly leaning towards in-situ analysis and visualization, more floating-point data from scientific simulations are being stored in databases like Key-Value Stores and queried using in-memory retrieval paradigms. This trend underscores the urgent need for a collective study of these compression methods' strengths and limitations, not only based on their performance in compressing data from various domains but also on their runtime characteristics. Our study extensively evaluates the performance of eight CPU-based and five GPU-based compression methods developed by both communities, using 33 real-world datasets assembled in the Floating-point Compressor Benchmark (FCBench). Additionally, we utilize the roofline model to profile their runtime bottlenecks. Our goal is to offer insights into these compression methods that could assist researchers in selecting existing methods or developing new ones for integrated database and HPC applications.

Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit impressive problem-solving skills in many tasks and domains, but their ability in mathematical reasoning in visual contexts has not been systematically studied. To bridge this gap, we present MathVista, a benchmark designed to combine challenges from diverse mathematical and visual tasks. It consists of 6,141 examples, derived from 28 existing multimodal datasets involving mathematics and 3 newly created datasets (i.e., IQTest, FunctionQA, and PaperQA). Completing these tasks requires fine-grained, deep visual understanding and compositional reasoning, which all state-of-the-art foundation models find challenging. With MathVista, we have conducted a comprehensive, quantitative evaluation of 12 prominent foundation models. The best-performing GPT-4V model achieves an overall accuracy of 49.9%, substantially outperforming Bard, the second-best performer, by 15.1%. Our in-depth analysis reveals that the superiority of GPT-4V is mainly attributed to its enhanced visual perception and mathematical reasoning. However, GPT-4V still falls short of human performance by 10.4%, as it often struggles to understand complex figures and perform rigorous reasoning. This significant gap underscores the critical role that MathVista will play in the development of general-purpose AI agents capable of tackling mathematically intensive and visually rich real-world tasks. We further explore the new ability of self-verification, the application of self-consistency, and the interactive chatbot capabilities of GPT-4V, highlighting its promising potential for future research. The project is available at //mathvista.github.io/.

We propose a natural language prompt-based retrieval augmented generation (Prompt-RAG), a novel approach to enhance the performance of generative large language models (LLMs) in niche domains. Conventional RAG methods mostly require vector embeddings, yet the suitability of generic LLM-based embedding representations for specialized domains remains uncertain. To explore and exemplify this point, we compared vector embeddings from Korean Medicine (KM) and Conventional Medicine (CM) documents, finding that KM document embeddings correlated more with token overlaps and less with human-assessed document relatedness, in contrast to CM embeddings. Prompt-RAG, distinct from conventional RAG models, operates without the need for embedding vectors. Its performance was assessed through a Question-Answering (QA) chatbot application, where responses were evaluated for relevance, readability, and informativeness. The results showed that Prompt-RAG outperformed existing models, including ChatGPT and conventional vector embedding-based RAGs, in terms of relevance and informativeness. Despite challenges like content structuring and response latency, the advancements in LLMs are expected to encourage the use of Prompt-RAG, making it a promising tool for other domains in need of RAG methods.

This work presents Depth Anything, a highly practical solution for robust monocular depth estimation. Without pursuing novel technical modules, we aim to build a simple yet powerful foundation model dealing with any images under any circumstances. To this end, we scale up the dataset by designing a data engine to collect and automatically annotate large-scale unlabeled data (~62M), which significantly enlarges the data coverage and thus is able to reduce the generalization error. We investigate two simple yet effective strategies that make data scaling-up promising. First, a more challenging optimization target is created by leveraging data augmentation tools. It compels the model to actively seek extra visual knowledge and acquire robust representations. Second, an auxiliary supervision is developed to enforce the model to inherit rich semantic priors from pre-trained encoders. We evaluate its zero-shot capabilities extensively, including six public datasets and randomly captured photos. It demonstrates impressive generalization ability. Further, through fine-tuning it with metric depth information from NYUv2 and KITTI, new SOTAs are set. Our better depth model also results in a better depth-conditioned ControlNet. Our models are released at //github.com/LiheYoung/Depth-Anything.

While large language models (LLMs) have demonstrated remarkable capabilities across a range of downstream tasks, a significant concern revolves around their propensity to exhibit hallucinations: LLMs occasionally generate content that diverges from the user input, contradicts previously generated context, or misaligns with established world knowledge. This phenomenon poses a substantial challenge to the reliability of LLMs in real-world scenarios. In this paper, we survey recent efforts on the detection, explanation, and mitigation of hallucination, with an emphasis on the unique challenges posed by LLMs. We present taxonomies of the LLM hallucination phenomena and evaluation benchmarks, analyze existing approaches aiming at mitigating LLM hallucination, and discuss potential directions for future research.

北京阿比特科技有限公司