亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

Accents play a pivotal role in shaping human communication, enhancing our ability to convey and comprehend messages with clarity and cultural nuance. While there has been significant progress in Automatic Speech Recognition (ASR), African-accented English ASR has been understudied due to a lack of training datasets, which are often expensive to create and demand colossal human labor. Combining several active learning paradigms and the core-set approach, we propose a new multi-rounds adaptation process that uses epistemic uncertainty to automate the annotation process, significantly reducing the associated costs and human labor. This novel method streamlines data annotation and strategically selects data samples that contribute most to model uncertainty, thereby enhancing training efficiency. We define a new metric called U-WER to track model adaptation to hard accents. We evaluate our approach across several domains, datasets, and high-performing speech models. Our results show that our approach leads to a 69.44\% WER improvement while requiring on average 45\% less data than established baselines. Our approach also improves out-of-distribution generalization for very low-resource accents, demonstrating its viability for building generalizable ASR models in the context of accented African ASR. We open-source the code here: //github.com/bonaventuredossou/active_learning_african_asr

相關內容

語(yu)(yu)音(yin)識別是計(ji)算(suan)(suan)機科學(xue)和(he)(he)計(ji)算(suan)(suan)語(yu)(yu)言學(xue)的一個跨學(xue)科子領域,它發展了一些方法(fa)和(he)(he)技(ji)術,使計(ji)算(suan)(suan)機可以(yi)將口語(yu)(yu)識別和(he)(he)翻(fan)譯(yi)成文本(ben)。 它也被稱為自(zi)動語(yu)(yu)音(yin)識別(ASR),計(ji)算(suan)(suan)機語(yu)(yu)音(yin)識別或語(yu)(yu)音(yin)轉文本(ben)(STT)。它整(zheng)合了計(ji)算(suan)(suan)機科學(xue),語(yu)(yu)言學(xue)和(he)(he)計(ji)算(suan)(suan)機工程領域的知識和(he)(he)研究。

4D head capture aims to generate dynamic topological meshes and corresponding texture maps from videos, which is widely utilized in movies and games for its ability to simulate facial muscle movements and recover dynamic textures in pore-squeezing. The industry often adopts the method involving multi-view stereo and non-rigid alignment. However, this approach is prone to errors and heavily reliant on time-consuming manual processing by artists. To simplify this process, we propose Topo4D, a novel framework for automatic geometry and texture generation, which optimizes densely aligned 4D heads and 8K texture maps directly from calibrated multi-view time-series images. Specifically, we first represent the time-series faces as a set of dynamic 3D Gaussians with fixed topology in which the Gaussian centers are bound to the mesh vertices. Afterward, we perform alternative geometry and texture optimization frame-by-frame for high-quality geometry and texture learning while maintaining temporal topology stability. Finally, we can extract dynamic facial meshes in regular wiring arrangement and high-fidelity textures with pore-level details from the learned Gaussians. Extensive experiments show that our method achieves superior results than the current SOTA face reconstruction methods both in the quality of meshes and textures. Project page: //xuanchenli.github.io/Topo4D/.

Video Temporal Grounding (VTG) focuses on accurately identifying event timestamps within a particular video based on a linguistic query, playing a vital role in downstream tasks such as video browsing and editing. While Video Large Language Models (video LLMs) have made significant progress in understanding video content, they often face challenges in accurately pinpointing timestamps within videos, which limits their performance on VTG tasks. Therefore, to improve video LLMs' ability to effectively locate timestamps, we argue that two critical aspects need to be enhanced. First, it is essential to have high-quality instructional tuning datasets that encompass mainstream VTG tasks. Second, directly incorporating timestamp knowledge into video LLMs is crucial, as it enables models to efficiently comprehend timestamp information. To address these needs, we first introduce VTG-IT-120K, a high-quality and comprehensive instruction tuning dataset that covers VTG tasks such as moment retrieval, dense video captioning, video summarization, and video highlight detection. Furthermore, we propose a specially designed video LLM model for VTG tasks, VTG-LLM, which (1) effectively integrates timestamp knowledge into visual tokens; (2) incorporates absolute-time tokens that specifically handle timestamp knowledge, thereby avoiding concept shifts; and (3) introduces a lightweight, high-performance slot-based token compression method to facilitate the sampling of more video frames. Comprehensive experiments showcase the superior performance of VTG-LLM in comparison to other video LLM methods across various VTG tasks. Our code and datasets are available at \url{//github.com/gyxxyg/VTG-LLM}.

Segment Anything Model (SAM) has attracted widespread attention for its superior interactive segmentation capabilities with visual prompts while lacking further exploration of text prompts. In this paper, we empirically investigate what text prompt encoders (e.g., CLIP or LLM) are good for adapting SAM for referring expression segmentation and introduce the Early Vision-language Fusion-based SAM (EVF-SAM). EVF-SAM is a simple yet effective referring segmentation method which exploits multimodal prompts (i.e., image and text) and comprises a pre-trained vision-language model to generate referring prompts and a SAM model for segmentation. Surprisingly, we observe that: (1) multimodal prompts and (2) vision-language models with early fusion (e.g., BEIT-3) are beneficial for prompting SAM for accurate referring segmentation. Our experiments show that the proposed EVF-SAM based on BEIT-3 can obtain state-of-the-art performance on RefCOCO/+/g for referring expression segmentation and demonstrate the superiority of prompting SAM with early vision-language fusion. In addition, the proposed EVF-SAM with 1.32B parameters achieves remarkably higher performance while reducing nearly 82% of parameters compared to previous SAM methods based on large multimodal models.

Social media plays a central role in shaping public opinion and behavior, yet performing experiments on these platforms and, in particular, on feed algorithms is becoming increasingly challenging. This article offers practical recommendations to researchers developing and deploying field experiments focused on real-time re-ranking of social media feeds. This article is organized around two contributions. First, we overview an experimental method using web browser extensions that intercepts and re-ranks content in real-time, enabling naturalistic re-ranking field experiments. We then describe feed interventions and measurements that this paradigm enables on participants' actual feeds, without requiring the involvement of social media platforms. Second, we offer concrete technical recommendations for intercepting and re-ranking social media feeds with minimal user-facing delay, and provide an open-source implementation. This document aims to summarize lessons learned, provide concrete implementation details, and foster the ecosystem of independent social media research.

Large language models (LLMs) have achieved remarkable success but still tend to generate factually erroneous responses, a phenomenon known as hallucination. A recent trend is to use preference learning to fine-tune models to align with factuality. However, existing work primarily evaluates fine-tuned models on in-domain (ID) datasets and the factuality on out-of-domain (OOD) datasets remains underexplored. In this paper, we conduct a comprehensive evaluation of the factuality of different models tuned by various preference learning algorithms and demonstrate that their performance on OOD datasets either increases minimally or decreases. Subsequently, we reveal that the main cause of model's failure to uphold factuality under a distribution shift is \textbf{under-alignment}, rather than \textbf{over-alignment}, by analyzing the token distribution shift of the models before and after tuning. Finally, we propose \textbf{APEFT} (\textbf{A}tomic \textbf{P}reference \textbf{E}nhanced \textbf{F}actuality \textbf{T}uning), a framework that enhances model's awareness of factuality at the granularity of individual facts. Extensive experiments demonstrate that APEFT improves model performance by an average of $\boldsymbol{3.45\%}$ on both ID and OOD datasets, which is highly effective.

Laparoscopic liver surgery poses a complex intraoperative dynamic environment for surgeons, where remains a significant challenge to distinguish critical or even hidden structures inside the liver. Liver anatomical landmarks, e.g., ridge and ligament, serve as important markers for 2D-3D alignment, which can significantly enhance the spatial perception of surgeons for precise surgery. To facilitate the detection of laparoscopic liver landmarks, we collect a novel dataset called L3D, which comprises 1,152 frames with elaborated landmark annotations from surgical videos of 39 patients across two medical sites. For benchmarking purposes, 12 mainstream detection methods are selected and comprehensively evaluated on L3D. Further, we propose a depth-driven geometric prompt learning network, namely D2GPLand. Specifically, we design a Depth-aware Prompt Embedding (DPE) module that is guided by self-supervised prompts and generates semantically relevant geometric information with the benefit of global depth cues extracted from SAM-based features. Additionally, a Semantic-specific Geometric Augmentation (SGA) scheme is introduced to efficiently merge RGB-D spatial and geometric information through reverse anatomic perception. The experimental results indicate that D2GPLand obtains state-of-the-art performance on L3D, with 63.52% DICE and 48.68% IoU scores. Together with 2D-3D fusion technology, our method can directly provide the surgeon with intuitive guidance information in laparoscopic scenarios.

Given comparative text, comparative relation extraction aims to extract two targets (\eg two cameras) in comparison and the aspect they are compared for (\eg image quality). The extracted comparative relations form the basis of further opinion analysis.Existing solutions formulate this task as a sequence labeling task, to extract targets and aspects. However, they cannot directly extract comparative relation(s) from text. In this paper, we show that comparative relations can be directly extracted with high accuracy, by generative model. Based on GPT-2, we propose a Generation-based Comparative Relation Extractor (GCRE-GPT). Experiment results show that \modelname achieves state-of-the-art accuracy on two datasets.

Large Language Models (LLMs) have emerged as a new paradigm for embodied reasoning and control, most recently by generating robot policy code that utilizes a custom library of vision and control primitive skills. However, prior arts fix their skills library and steer the LLM with carefully hand-crafted prompt engineering, limiting the agent to a stationary range of addressable tasks. In this work, we introduce LRLL, an LLM-based lifelong learning agent that continuously grows the robot skill library to tackle manipulation tasks of ever-growing complexity. LRLL achieves this with four novel contributions: 1) a soft memory module that allows dynamic storage and retrieval of past experiences to serve as context, 2) a self-guided exploration policy that proposes new tasks in simulation, 3) a skill abstractor that distills recent experiences into new library skills, and 4) a lifelong learning algorithm for enabling human users to bootstrap new skills with minimal online interaction. LRLL continuously transfers knowledge from the memory to the library, building composable, general and interpretable policies, while bypassing gradient-based optimization, thus relieving the learner from catastrophic forgetting. Empirical evaluation in a simulated tabletop environment shows that LRLL outperforms end-to-end and vanilla LLM approaches in the lifelong setup while learning skills that are transferable to the real world. Project material will become available at the webpage //gtziafas.github.io/LRLL_project.

Images can convey rich semantics and induce various emotions in viewers. Recently, with the rapid advancement of emotional intelligence and the explosive growth of visual data, extensive research efforts have been dedicated to affective image content analysis (AICA). In this survey, we will comprehensively review the development of AICA in the recent two decades, especially focusing on the state-of-the-art methods with respect to three main challenges -- the affective gap, perception subjectivity, and label noise and absence. We begin with an introduction to the key emotion representation models that have been widely employed in AICA and description of available datasets for performing evaluation with quantitative comparison of label noise and dataset bias. We then summarize and compare the representative approaches on (1) emotion feature extraction, including both handcrafted and deep features, (2) learning methods on dominant emotion recognition, personalized emotion prediction, emotion distribution learning, and learning from noisy data or few labels, and (3) AICA based applications. Finally, we discuss some challenges and promising research directions in the future, such as image content and context understanding, group emotion clustering, and viewer-image interaction.

Machine learning plays a role in many deployed decision systems, often in ways that are difficult or impossible to understand by human stakeholders. Explaining, in a human-understandable way, the relationship between the input and output of machine learning models is essential to the development of trustworthy machine-learning-based systems. A burgeoning body of research seeks to define the goals and methods of explainability in machine learning. In this paper, we seek to review and categorize research on counterfactual explanations, a specific class of explanation that provides a link between what could have happened had input to a model been changed in a particular way. Modern approaches to counterfactual explainability in machine learning draw connections to the established legal doctrine in many countries, making them appealing to fielded systems in high-impact areas such as finance and healthcare. Thus, we design a rubric with desirable properties of counterfactual explanation algorithms and comprehensively evaluate all currently-proposed algorithms against that rubric. Our rubric provides easy comparison and comprehension of the advantages and disadvantages of different approaches and serves as an introduction to major research themes in this field. We also identify gaps and discuss promising research directions in the space of counterfactual explainability.

北京阿比特科技有限公司