亚洲精品无码国产爽快A片百度,国产精品性爱视频亚洲国产黄片,无遮挡又黄又刺激的免费视频,日韩一级丆片一级内射视频

Visual-language pre-training has achieved remarkable success in many multi-modal tasks, largely attributed to the availability of large-scale image-text datasets. In this work, we demonstrate that Multi-modal Large Language Models (MLLMs) can enhance visual-language representation learning by establishing richer image-text associations for image-text datasets. Our approach is simple, utilizing MLLMs to extend multiple diverse captions for each image. To prevent the bias introduced by MLLMs' hallucinations and monotonous language styles, we propose "text shearing" to maintain the quality and availability of extended captions. In image-text retrieval, without introducing additional training cost, our method consistently obtains 5.6 ~ 35.0 and 16.8 ~ 46.1 improvement on Recall@1 under the fine-tuning and zero-shot settings, respectively. Notably, we obtain zero-shot results that are comparable to fine-tuning on target datasets, which encourages more exploration of the versatile use of MLLMs.

相關內容

表示學習

關注 186

表示學習是通過利用訓練數據來學習得到向量表示，這可以克服人工方法的局限性。表示學習通常可分為兩大類，無監督和有監督表示學習。大多數無監督表示學習方法利用自動編碼器（如去噪自動編碼器和稀疏自動編碼器等）中的隱變量作為表示。目前出現的變分自動編碼器能夠更好的容忍噪聲和異常值。然而，推斷給定數據的潛在結構幾乎是不可能的。目前有一些近似推斷的策略。此外，一些無監督表示學習方法旨在近似某種特定的相似性度量。提出了一種無監督的相似性保持表示學習框架，該框架使用矩陣分解來保持成對的DTW相似性。通過學習保持DTW的shaplets，即在轉換后的空間中的歐式距離近似原始數據的真實DTW距離。有監督表示學習方法可以利用數據的標簽信息，更好地捕獲數據的語義結構。孿生網絡和三元組網絡是目前兩種比較流行的模型，它們的目標是最大化類別之間的距離并最小化了類別內部的距離。

MoDELS · 大語言模型 · SimPLe · Less · 有向 ·

2024 年 4 月 25 日

Weak-to-Strong Extrapolation Expedites Alignment

Chujie Zheng,Ziqi Wang,Heng Ji,Minlie Huang,Nanyun Peng

Although the capabilities of large language models (LLMs) ideally scale up with increasing data and compute, they are inevitably constrained by limited resources in reality. Suppose we have a moderately trained LLM (e.g., trained to align with human preference) in hand, can we further exploit its potential and cheaply acquire a stronger model? In this paper, we propose a simple method called ExPO to boost LLMs' alignment with human preference. ExPO assumes that a medium-aligned model can be interpolated between a less-aligned (weaker) model, e.g., the initial SFT model, and a better-aligned (stronger) one, thereby directly obtaining this stronger model by extrapolating from the weights of the former two relatively weaker models. On the AlpacaEval 2.0 benchmark, we show that ExPO pushes models trained with less preference data (e.g., 10% or 20%) to reach and even surpass the fully-trained one, without any additional training. Furthermore, ExPO also significantly improves off-the-shelf DPO/RLHF models and exhibits decent scalability across model sizes from 7B to 70B. Our work demonstrates the efficacy of model extrapolation in exploiting LLMs' capabilities, suggesting a promising direction that deserves future exploration.

MoDELS · 語言模型化 · 大語言模型 · Notability · 可理解性 ·

2024 年 4 月 25 日

Tele-FLM Technical Report

Xiang Li,Yiqun Yao,Xin Jiang,Xuezhi Fang,Chao Wang,Xinzhang Liu,Zihan Wang,Yu Zhao,Xin Wang,Yuyao Huang,Shuangyong Song,Yongxiang Li,Zheng Zhang,Bo Zhao,Aixin Sun,Yequan Wang,Zhongjiang He,Zhongyuan Wang,Xuelong Li,Tiejun Huang

Large language models (LLMs) have showcased profound capabilities in language understanding and generation, facilitating a wide array of applications. However, there is a notable paucity of detailed, open-sourced methodologies on efficiently scaling LLMs beyond 50 billion parameters with minimum trial-and-error cost and computational resources. In this report, we introduce Tele-FLM (aka FLM-2), a 52B open-sourced multilingual large language model that features a stable, efficient pre-training paradigm and enhanced factual judgment capabilities. Tele-FLM demonstrates superior multilingual language modeling abilities, measured by BPB on textual corpus. Besides, in both English and Chinese foundation model evaluation, it is comparable to strong open-sourced models that involve larger pre-training FLOPs, such as Llama2-70B and DeepSeek-67B. In addition to the model weights, we share the core designs, engineering practices, and training details, which we expect to benefit both the academic and industrial communities.

Chatbot · Performer · state-of-the-art · 語言模型化 · 泛化理論 ·

2024 年 4 月 24 日

Domain-Specific Improvement on Psychotherapy Chatbot Using Assistant

Cheng Kang,Daniel Novak,Katerina Urbanova,Yuqing Cheng,Yong Hu

from arxiv, Accepted at ICASSP 2024 EIHRC

Large language models (LLMs) have demonstrated impressive generalization capabilities on specific tasks with human-written instruction data. However, the limited quantity, diversity, and professional expertise of such instruction data raise concerns about the performance of LLMs in psychotherapy tasks when provided with domain-specific instructions. To address this, we firstly propose Domain-Specific Assistant Instructions based on AlexanderStreet therapy, and secondly, we use an adaption fine-tuning method and retrieval augmented generation method to improve pre-trained LLMs. Through quantitative evaluation of linguistic quality using automatic and human evaluation, we observe that pre-trained LLMs on Psychotherapy Assistant Instructions outperform state-of-the-art LLMs response baselines. Our Assistant-Instruction approach offers a half-annotation method to align pre-trained LLMs with instructions and provide pre-trained LLMs with more psychotherapy knowledge.

可辨認的 · 值域 · N元 · 哈希學習 · 縮放 ·

2024 年 4 月 24 日

Noise-Robust De-Duplication at Scale

Emily Silcock,Luca D'Amico-Wong,Jinglin Yang,Melissa Dell

Identifying near duplicates within large, noisy text corpora has a myriad of applications that range from de-duplicating training datasets, reducing privacy risk, and evaluating test set leakage, to identifying reproduced news articles and literature within large corpora. Across these diverse applications, the overwhelming majority of work relies on N-grams. Limited efforts have been made to evaluate how well N-gram methods perform, in part because it is unclear how one could create an unbiased evaluation dataset for a massive corpus. This study uses the unique timeliness of historical news wires to create a 27,210 document dataset, with 122,876 positive duplicate pairs, for studying noise-robust de-duplication. The time-sensitivity of news makes comprehensive hand labelling feasible - despite the massive overall size of the corpus - as duplicates occur within a narrow date range. The study then develops and evaluates a range of de-duplication methods: hashing and N-gram overlap (which predominate in the literature), a contrastively trained bi-encoder, and a re-rank style approach combining a bi- and cross-encoder. The neural approaches significantly outperform hashing and N-gram overlap. We show that the bi-encoder scales well, de-duplicating a 10 million article corpus on a single GPU card in a matter of hours. We also apply our pre-trained model to the RealNews and patent portions of C4 (Colossal Clean Crawled Corpus), illustrating that a neural approach can identify many near duplicates missed by hashing, in the presence of various types of noise. The public release of our NEWS-COPY de-duplication dataset, codebase, and the pre-trained models will facilitate further research and applications.

INFORMS · MoDELS · Less · 情景 · 剪枝 ·

2024 年 4 月 24 日

Retrieval Head Mechanistically Explains Long-Context Factuality

Wenhao Wu,Yizhong Wang,Guangxuan Xiao,Hao Peng,Yao Fu

from arxiv, Preprint

Despite the recent progress in long-context language models, it remains elusive how transformer-based models exhibit the capability to retrieve relevant information from arbitrary locations within the long context. This paper aims to address this question. Our systematic investigation across a wide spectrum of models reveals that a special type of attention heads are largely responsible for retrieving information, which we dub retrieval heads. We identify intriguing properties of retrieval heads:(1) universal: all the explored models with long-context capability have a set of retrieval heads; (2) sparse: only a small portion (less than 5\%) of the attention heads are retrieval. (3) intrinsic: retrieval heads already exist in models pretrained with short context. When extending the context length by continual pretraining, it is still the same set of heads that perform information retrieval. (4) dynamically activated: take Llama-2 7B for example, 12 retrieval heads always attend to the required information no matter how the context is changed. The rest of the retrieval heads are activated in different contexts. (5) causal: completely pruning retrieval heads leads to failure in retrieving relevant information and results in hallucination, while pruning random non-retrieval heads does not affect the model's retrieval ability. We further show that retrieval heads strongly influence chain-of-thought (CoT) reasoning, where the model needs to frequently refer back the question and previously-generated context. Conversely, tasks where the model directly generates the answer using its intrinsic knowledge are less impacted by masking out retrieval heads. These observations collectively explain which internal part of the model seeks information from the input tokens. We believe our insights will foster future research on reducing hallucination, improving reasoning, and compressing the KV cache.

INFORMS · 模態 · 多峰值 · MoDELS · 多模態學習 ·

2024 年 4 月 23 日

Neuro-Inspired Hierarchical Multimodal Learning

Xiongye Xiao,Gengshuo Liu,Gaurav Gupta,Defu Cao,Shixuan Li,Yaxing Li,Tianqing Fang,Mingxi Cheng,Paul Bogdan

from arxiv, I am requesting the withdrawal of this submission due to an inadvertent duplication. The paper was submitted twice under different IDs, which was not intentional. The other submission (arXiv:2404.09403) contains the most updated and comprehensive version of the paper, and I would like to retain that as the sole version on the platform

Integrating and processing information from various sources or modalities are critical for obtaining a comprehensive and accurate perception of the real world. Drawing inspiration from neuroscience, we develop the Information-Theoretic Hierarchical Perception (ITHP) model, which utilizes the concept of information bottleneck. Distinct from most traditional fusion models that aim to incorporate all modalities as input, our model designates the prime modality as input, while the remaining modalities act as detectors in the information pathway. Our proposed perception model focuses on constructing an effective and compact information flow by achieving a balance between the minimization of mutual information between the latent state and the input modal state, and the maximization of mutual information between the latent states and the remaining modal states. This approach leads to compact latent state representations that retain relevant information while minimizing redundancy, thereby substantially enhancing the performance of downstream tasks. Experimental evaluations on both the MUStARD and CMU-MOSI datasets demonstrate that our model consistently distills crucial information in multimodal learning scenarios, outperforming state-of-the-art benchmarks.

知識 (knowledge) · INFORMS · Performer · 有偏 · Learning ·

2024 年 4 月 23 日

Grounded Knowledge-Enhanced Medical VLP for Chest X-Ray

Qiao Deng,Zhongzhen Huang,Yunqi Wang,Zhichuan Wang,Zhao Wang,Xiaofan Zhang,Qi Dou,Yeung Yu Hui,Edward S. Hui

Medical vision-language pre-training has emerged as a promising approach for learning domain-general representations of medical image and text. Current algorithms that exploit the global and local alignment between medical image and text could however be marred by the redundant information in medical data. To address this issue, we propose a grounded knowledge-enhanced medical vision-language pre-training (GK-MVLP) framework for chest X-ray. In this framework, medical knowledge is grounded to the appropriate anatomical regions by using a transformer-based grounded knowledge-enhanced module for fine-grained alignment between anatomical region-level visual features and the textural features of medical knowledge. The performance of GK-MVLP is competitive with or exceeds the state of the art on downstream chest X-ray disease classification, disease localization, report generation, and medical visual question-answering tasks. Our results show the advantage of incorporating grounding mechanism to remove biases and improve the alignment between chest X-ray image and radiology report.

峰值 · TMC · 流 · MoDELS · MINE ·

2024 年 4 月 22 日

Hybrid Ensemble-Based Travel Mode Prediction

Pawe? Golik,Maciej Grzenda,El?bieta Sienkiewicz

from arxiv, This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this contribution is published in Advances in Intelligent Data Analysis XXII. IDA 2024. Lecture Notes in Computer Science, vol 14641. Springer, and is available online at Cham //doi.org/10.1007/978-3-031-58547-0_16 The preprint includes 12+22 pages, 1+1 figures

Travel mode choice (TMC) prediction, which can be formulated as a classification task, helps in understanding what makes citizens choose different modes of transport for individual trips. This is also a major step towards fostering sustainable transportation. As behaviour may evolve over time, we also face the question of detecting concept drift in the data. This necessitates using appropriate methods to address potential concept drift. In particular, it is necessary to decide whether batch or stream mining methods should be used to develop periodically updated TMC models. To address the challenge of the development of TMC models, we propose the novel Incremental Ensemble of Batch and Stream Models (IEBSM) method aimed at adapting travel mode choice classifiers to concept drift possibly occurring in the data. It relies on the combination of drift detectors with batch learning and stream mining models. We compare it against batch and incremental learners, including methods relying on active drift detection. Experiments with varied travel mode data sets representing both city and country levels show that the IEBSM method both detects drift in travel mode data and successfully adapts the models to evolving travel mode choice data. The method has a higher rank than batch and stream learners.

MoDELS · Learning · Networking · 聯邦學習 · 邊 ·

2024 年 4 月 19 日

Differentially-Private Hierarchical Federated Learning

Frank Po-Chen Lin,Christopher Brinton

While federated learning (FL) eliminates the transmission of raw data over a network, it is still vulnerable to privacy breaches from the communicated model parameters. In this work, we propose \underline{H}ierarchical \underline{F}ederated Learning with \underline{H}ierarchical \underline{D}ifferential \underline{P}rivacy ({\tt H$^2$FDP}), a DP-enhanced FL methodology for jointly optimizing privacy and performance in hierarchical networks. Building upon recent proposals for Hierarchical Differential Privacy (HDP), one of the key concepts of {\tt H$^2$FDP} is adapting DP noise injection at different layers of an established FL hierarchy -- edge devices, edge servers, and cloud servers -- according to the trust models within particular subnetworks. We conduct a comprehensive analysis of the convergence behavior of {\tt H$^2$FDP}, revealing conditions on parameter tuning under which the training process converges sublinearly to a finite stationarity gap that depends on the network hierarchy, trust model, and target privacy level. Leveraging these relationships, we develop an adaptive control algorithm for {\tt H$^2$FDP} that tunes properties of local model training to minimize communication energy, latency, and the stationarity gap while striving to maintain a sub-linear convergence rate and meet desired privacy criteria. Subsequent numerical evaluations demonstrate that {\tt H$^2$FDP} obtains substantial improvements in these metrics over baselines for different privacy budgets, and validate the impact of different system configurations.

蒸餾 · BERT · 語言模型化 · Performer · 可理解性 ·

2019 年 9 月 23 日

TinyBERT: Distilling BERT for Natural Language Understanding

Xiaoqi Jiao,Yichun Yin,Lifeng Shang,Xin Jiang,Xiao Chen,Linlin Li,Fang Wang,Qun Liu

from arxiv, 13 pages, 2 figures, 9 tables

Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive and memory intensive, so it is difficult to effectively execute them on some resource-restricted devices. To accelerate inference and reduce model size while maintaining accuracy, we firstly propose a novel transformer distillation method that is a specially designed knowledge distillation (KD) method for transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large teacher BERT can be well transferred to a small student TinyBERT. Moreover, we introduce a new two-stage learning framework for TinyBERT, which performs transformer distillation at both the pre-training and task-specific learning stages. This framework ensures that TinyBERT can capture both the general-domain and task-specific knowledge of the teacher BERT. TinyBERT is empirically effective and achieves comparable results with BERT in GLUE datasets, while being 7.5x smaller and 9.4x faster on inference. TinyBERT is also significantly better than state-of-the-art baselines, even with only about 28% parameters and 31% inference time of baselines.