青青国产成人久久激情91-国产女做A性色精品视频免费

This paper introduces and investigates the utilization of maximum and average distance correlations for multivariate independence testing. We characterize their consistency properties in high-dimensional settings with respect to the number of marginally dependent dimensions, assess the advantages of each test statistic, examine their respective null distributions, and present a fast chi-square-based testing procedure. The resulting tests are non-parametric and applicable to both Euclidean distance and the Gaussian kernel as the underlying metric. To better understand the practical use cases of the proposed tests, we evaluate the empirical performance of the maximum distance correlation, average distance correlation, and the original distance correlation across various multivariate dependence scenarios, as well as conduct a real data experiment to test the presence of various cancer types and peptide levels in human plasma.

相關內容

相關系數

關注 0

MoDELS · 語言模型化 · Extensibility · 大語言模型 · Continuity ·

2024 年 3 月 18 日

A Novel Paradigm Boosting Translation Capabilities of Large Language Models

Jiaxin Guo,Hao Yang,Zongyao Li,Daimeng Wei,Hengchao Shang,Xiaoyu Chen

from arxiv, Accepted in NAACL 2024

This paper presents a study on strategies to enhance the translation capabilities of large language models (LLMs) in the context of machine translation (MT) tasks. The paper proposes a novel paradigm consisting of three stages: Secondary Pre-training using Extensive Monolingual Data, Continual Pre-training with Interlinear Text Format Documents, and Leveraging Source-Language Consistent Instruction for Supervised Fine-Tuning. Previous research on LLMs focused on various strategies for supervised fine-tuning (SFT), but their effectiveness has been limited. While traditional machine translation approaches rely on vast amounts of parallel bilingual data, our paradigm highlights the importance of using smaller sets of high-quality bilingual data. We argue that the focus should be on augmenting LLMs' cross-lingual alignment abilities during pre-training rather than solely relying on extensive bilingual data during SFT. Experimental results conducted using the Llama2 model, particularly on Chinese-Llama2 after monolingual augmentation, demonstrate the improved translation capabilities of LLMs. A significant contribution of our approach lies in Stage2: Continual Pre-training with Interlinear Text Format Documents, which requires less than 1B training data, making our method highly efficient. Additionally, in Stage3, we observed that setting instructions consistent with the source language benefits the supervised fine-tuning process. Experimental results demonstrate that our approach surpasses previous work and achieves superior performance compared to models such as NLLB-54B and GPT3.5-text-davinci-003, despite having a significantly smaller parameter count of only 7B or 13B. This achievement establishes our method as a pioneering strategy in the field of machine translation.

知識 (knowledge) · 可辨認的 · INFORMS · entity · 全 ·

2024 年 3 月 15 日

An Ecosystem for Personal Knowledge Graphs: A Survey and Research Roadmap

Martin G. Skj?veland,Krisztian Balog,Nolwenn Bernard,Weronika ?ajewska,Trond Linjordet

from arxiv, Published in AI Open, 2024

This paper presents an ecosystem for personal knowledge graphs (PKGs), commonly defined as resources of structured information about entities related to an individual, their attributes, and the relations between them. PKGs are a key enabler of secure and sophisticated personal data management and personalized services. However, there are challenges that need to be addressed before PKGs can achieve widespread adoption. One of the fundamental challenges is the very definition of what constitutes a PKG, as there are multiple interpretations of the term. We propose our own definition of a PKG, emphasizing the aspects of (1) data ownership by a single individual and (2) the delivery of personalized services as the primary purpose. We further argue that a holistic view of PKGs is needed to unlock their full potential, and propose a unified framework for PKGs, where the PKG is a part of a larger ecosystem with clear interfaces towards data services and data sources. A comprehensive survey and synthesis of existing work is conducted, with a mapping of the surveyed work into the proposed unified ecosystem. Finally, we identify open challenges and research opportunities for the ecosystem as a whole, as well as for the specific aspects of PKGs, which include population, representation and management, and utilization.

知識 (knowledge) · Continuity · 特征選擇 · 基 · 知識庫 ·

2024 年 3 月 15 日

Open Continual Feature Selection via Granular-Ball Knowledge Transfer

Xuemei Cao,Xin Yang,Shuyin Xia,Guoyin Wang,Tianrui Li

from arxiv, 14 pages, 7 figures, 6 tables

This paper presents a novel framework for continual feature selection (CFS) in data preprocessing, particularly in the context of an open and dynamic environment where unknown classes may emerge. CFS encounters two primary challenges: the discovery of unknown knowledge and the transfer of known knowledge. To this end, the proposed CFS method combines the strengths of continual learning (CL) with granular-ball computing (GBC), which focuses on constructing a granular-ball knowledge base to detect unknown classes and facilitate the transfer of previously learned knowledge for further feature selection. CFS consists of two stages: initial learning and open learning. The former aims to establish an initial knowledge base through multi-granularity representation using granular-balls. The latter utilizes prior granular-ball knowledge to identify unknowns, updates the knowledge base for granular-ball knowledge transfer, reinforces old knowledge, and integrates new knowledge. Subsequently, we devise an optimal feature subset mechanism that incorporates minimal new features into the existing optimal subset, often yielding superior results during each period. Extensive experimental results on public benchmark datasets demonstrate our method's superiority in terms of both effectiveness and efficiency compared to state-of-the-art feature selection methods.

ILP · 約束 · 整數線性規劃 · 在線 · 線性的 ·

2024 年 3 月 14 日

Long-Term or Temporary? Hybrid Worker Recruitment for Mobile Crowd Sensing and Computing

Minghui Liwang,Zhibin Gao,Seyyedali Hosseinalipour,Zhipeng Cheng,Xianbin Wang,Zhenzhen Jiao

This paper investigates a novel hybrid worker recruitment problem where the mobile crowd sensing and computing (MCSC) platform employs workers to serve MCSC tasks with diverse quality requirements and budget constraints, under uncertainties in workers' participation and their local workloads.We propose a hybrid worker recruitment framework consisting of offline and online trading modes. The former enables the platform to overbook long-term workers (services) to cope with dynamic service supply via signing contracts in advance, which is formulated as 0-1 integer linear programming (ILP) with probabilistic constraints of service quality and budget.Besides, motivated by the existing uncertainties which may render long-term workers fail to meet the service quality requirement of each task, we augment our methodology with an online temporary worker recruitment scheme as a backup Plan B to support seamless service provisioning for MCSC tasks, which also represents a 0-1 ILP problem. To tackle these problems which are proved to be NP-hard, we develop three algorithms, namely, i) exhaustive searching, ii) unique index-based stochastic searching with risk-aware filter constraint, iii) geometric programming-based successive convex algorithm, which achieve the optimal or sub-optimal solutions. Experimental results demonstrate our effectiveness in terms of service quality, time efficiency, etc.

MIMO · 塊 · 通道 · 代碼 · 可約的 ·

2024 年 3 月 13 日

Maximum Channel Coding Rate of Finite Block Length MIMO Faster-Than-Nyquist Signaling

Zichao Zhang,Melda Yuksel,Halim Yanikomeroglu,Benjamin K. Ng,Chan-Tong Lam

The pursuit of higher data rates and efficient spectrum utilization in modern communication technologies necessitates novel solutions. In order to provide insights into improving spectral efficiency and reducing latency, this study investigates the maximum channel coding rate (MCCR) of finite block length (FBL) multiple-input multiple-output (MIMO) faster-than-Nyquist (FTN) channels. By optimizing power allocation, we derive the system's MCCR expression. Simulation results are compared with the existing literature to reveal the benefits of FTN in FBL transmission.

Analysis · 可辨認的 · Performer · 值域 · 離散化 ·

2024 年 3 月 13 日

Timed Strategies for Real-Time Rewrite Theories

Carlos Olarte,Peter Csaba ?lveczky

In this paper we propose a language for conveniently defining a wide range of execution strategies for real-time rewrite theories, and provide Maude-strategy-implemented versions of most Real-Time Maude analysis methods, albeit with user-defined discrete and timed strategies. We also identify a new time sampling strategy that should provide both efficient and exhaustive analysis for many distributed real-time systems. We exemplify the use of our language and its analyses on a simple round trip time protocol, and compare the performance of standard Maude search with our strategy-implemented reachability analyses on the CASH scheduling algorithm benchmark.

MoDELS · 數據預處理 · tuning · Performer · HTTPS ·

2024 年 3 月 13 日

Jellyfish: A Large Language Model for Data Preprocessing

Haochen Zhang,Yuyang Dong,Chuan Xiao,Masafumi Oyamada

from arxiv, a.k.a. "Jellyfish: Instruction-Tuning Local Large Language Models for Data Preprocessing''

This paper explores the utilization of LLMs for data preprocessing (DP), a crucial step in the data mining pipeline that transforms raw data into a clean format conducive to easy processing. Whereas the use of LLMs has sparked interest in devising universal solutions to DP, recent initiatives in this domain typically rely on GPT APIs, raising inevitable data breach concerns. Unlike these approaches, we consider instruction-tuning local LLMs (7 - 13B models) as universal DP ask solver. We select a collection of datasets across four representative DP tasks and construct instruction-tuning data using serialization and knowledge injection techniques tailored to DP. As such, the instruction-tuned LLMs empower users to manually craft instructions for DP. Meanwhile, they can operate on a local, single, and low-priced GPU, ensuring data security and enabling further tuning. Our experiments show that our dataset constructed for DP instruction tuning, namely Jellyfish, effectively enhances LLMs' DP performances and barely compromises their abilities in NLP tasks. By tuning Mistral-7B and OpenOrca-Platypus2-13B with Jellyfish, the models deliver competitiveness compared to state-of-the-art DP methods and strong generalizability to unseen tasks. The models' performance rivals that of GPT series models, and the interpretation offers enhanced reasoning capabilities compared to GPT-3.5. The 7B and 13B Jellyfish models are available at Hugging Face: //huggingface.co/NECOUDBFM/Jellyfish-7B //huggingface.co/NECOUDBFM/Jellyfish-13B

Performer · NLP · GloVe · MPNet · INFORMS ·

2024 年 3 月 13 日

BED: Bi-Encoder-Based Detectors for Out-of-Distribution Detection

Louis Owen,Biddwan Ahmed,Abhay Kumar

from arxiv, Published in IEEE: //ieeexplore.ieee.org/document/10389907

This paper introduces a novel method leveraging bi-encoder-based detectors along with a comprehensive study comparing different out-of-distribution (OOD) detection methods in NLP using different feature extractors. The feature extraction stage employs popular methods such as Universal Sentence Encoder (USE), BERT, MPNET, and GLOVE to extract informative representations from textual data. The evaluation is conducted on several datasets, including CLINC150, ROSTD-Coarse, SNIPS, and YELLOW. Performance is assessed using metrics such as F1-Score, MCC, FPR@90, FPR@95, AUPR, an AUROC. The experimental results demonstrate that the proposed bi-encoder-based detectors outperform other methods, both those that require OOD labels in training and those that do not, across all datasets, showing great potential for OOD detection in NLP. The simplicity of the training process and the superior detection performance make them applicable to real-world scenarios. The presented methods and benchmarking metrics serve as a valuable resource for future research in OOD detection, enabling further advancements in this field. The code and implementation details can be found on our GitHub repository: //github.com/yellowmessenger/ood-detection.

相互獨立的 · 異常點 · 相似度度量 · 可理解性 · 相似度 ·

2024 年 3 月 13 日

Validating and Exploring Large Geographic Corpora

Jonathan Dunn

This paper investigates the impact of corpus creation decisions on large multi-lingual geographic web corpora. Beginning with a 427 billion word corpus derived from the Common Crawl, three methods are used to improve the quality of sub-corpora representing specific language-country pairs like New Zealand English: (i) the agreement of independent language identification systems, (ii) hash-based deduplication, and (iii) location-specific outlier detection. The impact of each of these steps is then evaluated at the language level and the country level by using corpus similarity measures to compare each resulting corpus with baseline data sets. The goal is to understand the impact of upstream data cleaning decisions on downstream corpora with a specific focus on under-represented languages and populations. The evaluation shows that the validity of sub-corpora is improved with each stage of cleaning but that this improvement is unevenly distributed across languages and populations. This result shows how standard corpus creation techniques can accidentally exclude under-represented populations.

極小點 · 異常點 · 估計/估計量 · 統計量 · 穩健性 ·

2024 年 3 月 12 日

Minimum Covariance Determinant: Spectral Embedding and Subset Size Determination

Qiang Heng,Hui Shen,Kenneth Lange

This paper introduces several enhancements to the minimum covariance determinant method of outlier detection and robust estimation of means and covariances. We leverage the principal component transform to achieve dimension reduction and ultimately better analyses. Our best subset selection algorithm strategically combines statistical depth and concentration steps. To ascertain the appropriate subset size and number of principal components, we introduce a bootstrap procedure that estimates the instability of the best subset algorithm. The parameter combination exhibiting minimal instability proves ideal for the purposes of outlier detection and robust estimation. Rigorous benchmarking against prominent MCD variants showcases our approach's superior statistical performance and computational speed in high dimensions. Application to a fruit spectra data set and a cancer genomics data set illustrates our claims.