清纯唯美另类亚洲欧美综合,69WW无码免费视频播放

In this study, we propose a staging area for ingesting new superconductors' experimental data in SuperCon that is machine-collected from scientific articles. Our objective is to enhance the efficiency of updating SuperCon while maintaining or enhancing the data quality. We present a semi-automatic staging area driven by a workflow combining automatic and manual processes on the extracted database. An anomaly detection automatic process aims to pre-screen the collected data. Users can then manually correct any errors through a user interface tailored to simplify the data verification on the original PDF documents. Additionally, when a record is corrected, its raw data is collected and utilised to improve machine learning models as training data. Evaluation experiments demonstrate that our staging area significantly improves curation quality. We compare the interface with the traditional manual approach of reading PDF documents and recording information in an Excel document. Using the interface boosts the precision and recall by 6% and 50%, respectively to an average increase of 40% in F1-score.

相關內容

Processing（編程語言）

關注 121

Processing 是一門開源編程語言和與之配套的集成開發環境（IDE）的名稱。Processing 在電子藝術和視覺設計社區被用來教授編程基礎，并運用于大量的新媒體和互動藝術作品中。

Networking · 評論員 · Learning · 計算成本 · GNN ·

2023 年 11 月 5 日

A graph-based probabilistic geometric deep learning framework with online enforcement of physical constraints to predict the criticality of defects in porous materials

Vasilis Krokos,Stéphane P. A. Bordas,Pierre Kerfriden

from arxiv, 68 pages; 52 figures

Stress prediction in porous materials and structures is challenging due to the high computational cost associated with direct numerical simulations. Convolutional Neural Network (CNN) based architectures have recently been proposed as surrogates to approximate and extrapolate the solution of such multiscale simulations. These methodologies are usually limited to 2D problems due to the high computational cost of 3D voxel based CNNs. We propose a novel geometric learning approach based on a Graph Neural Network (GNN) that efficiently deals with three-dimensional problems by performing convolutions over 2D surfaces only. Following our previous developments using pixel-based CNN, we train the GNN to automatically add local fine-scale stress corrections to an inexpensively computed coarse stress prediction in the porous structure of interest. Our method is Bayesian and generates densities of stress fields, from which credible intervals may be extracted. As a second scientific contribution, we propose to improve the extrapolation ability of our network by deploying a strategy of online physics-based corrections. Specifically, we condition the posterior predictions of our probabilistic predictions to satisfy partial equilibrium at the microscale, at the inference stage. This is done using an Ensemble Kalman algorithm, to ensure tractability of the Bayesian conditioning operation. We show that this innovative methodology allows us to alleviate the effect of undesirable biases observed in the outputs of the uncorrected GNN, and improves the accuracy of the predictions in general.

模型選擇 · 隨機場 · 估計/估計量 · Markov · 馬爾可夫隨機場 ·

2023 年 11 月 3 日

Model selection for Markov random fields on graphs under a mixing condition

Florencia Leonardi,Magno T. F Severino

In this work, we propose a global model selection criterion to estimate the graph of conditional dependencies of a random vector based on a finite sample. By global criterion, we mean optimizing a function over the entire set of possible graphs, eliminating the need to estimate the individual neighborhoods and subsequently combine them to estimate the graph. We prove the almost sure convergence of the graph estimator. This convergence holds provided the data is a realization of a multivariate stochastic process that satisfies a mixing condition. To the best of our knowledge, these are the first results to show the consistency of a model selection criterion for Markov random fields on graphs under non-independent data.

一詞多義性 · search engine · 泛函 · INFORMS · 查準率/準確率 ·

2023 年 11 月 3 日

Enhancing search engine precision and user experience through sentiment-based polysemy resolution

Mike Nkongolo

from arxiv, This research article was accepted at the International Journal of Intelligent Systems (Hindawi), titled "News classification and categorization with smart function sentiment analysis". It underwent editing by Yaxin Bi and quality checking by Saranya Manokaran

With the proliferation of digital content and the need for efficient information retrieval, this study's insights can be applied to various domains, including news services, e-commerce, and digital marketing, to provide users with more meaningful and tailored experiences. The study addresses the common problem of polysemy in search engines, where the same keyword may have multiple meanings. It proposes a solution to this issue by embedding a smart search function into the search engine, which can differentiate between different meanings based on sentiment. The study leverages sentiment analysis, a powerful natural language processing (NLP) technique, to classify and categorize news articles based on their emotional tone. This can provide more insightful and nuanced search results. The article reports an impressive accuracy rate of 85% for the proposed smart search function, which outperforms conventional search engines. This indicates the effectiveness of the sentiment-based approach. The research explores multiple sentiment analysis models, including Sentistrength and Valence Aware Dictionary for Sentiment Reasoning (VADER), to determine the best-performing approach. The findings can be applied to enhance search engines, making them more capable of understanding the context and intent behind users 'queries. This can lead to better search results that are more aligned with what users are looking for. The proposed smart search function can improve the user experience by reducing the need to sift through irrelevant search results. This is particularly important in an age where information overload is common.

估計/估計量 · 推斷 · 可理解性 · 分解的 · TOOLS ·

2023 年 11 月 2 日

Inference on summaries of a model-agnostic longitudinal variable importance trajectory

Brian D. Williamson,Erica E. M. Moodie,Susan M. Shortreed

from arxiv, 65 pages (29 main, 36 supplementary), 5 figures (3 main, 2 supplementary), 19 tables (2 main, 17 supplementary)

In prediction settings where data are collected over time, it is often of interest to understand both the importance of variables for predicting the response at each time point and the importance summarized over the time series. Building on recent advances in estimation and inference for variable importance measures, we define summaries of variable importance trajectories. These measures can be estimated and the same approaches for inference can be applied regardless of the choice of the algorithm(s) used to estimate the prediction function. We propose a nonparametric efficient estimation and inference procedure as well as a null hypothesis testing procedure that are valid even when complex machine learning tools are used for prediction. Through simulations, we demonstrate that our proposed procedures have good operating characteristics, and we illustrate their use by investigating the longitudinal importance of risk factors for suicide attempt.

估計/估計量 · 統計量 · INFORMS · Analysis · Performer ·

2023 年 11 月 2 日

The Causal Roadmap and simulation studies to inform the Statistical Analysis Plan for real-data applications

Nerissa Nance,Laura Balzer

The Causal Roadmap outlines a systematic approach to our research endeavors: define quantity of interest, evaluate needed assumptions, conduct statistical estimation, and carefully interpret of results. At the estimation step, it is essential that the estimation algorithm be chosen thoughtfully for its theoretical properties and expected performance. Simulations can help researchers gain a better understanding of an estimator's statistical performance under conditions unique to the real-data application. This in turn can inform the rigorous pre-specification of a Statistical Analysis Plan (SAP), not only stating the estimand (e.g., G-computation formula), the estimator (e.g., targeted minimum loss-based estimation [TMLE]), and adjustment variables, but also the implementation of the estimator -- including nuisance parameter estimation and approach for variance estimation. Doing so helps ensure valid inference (e.g., 95% confidence intervals with appropriate coverage). Failing to pre-specify estimation can lead to data dredging and inflated Type-I error rates.

WEB · MoDELS · 得分 · INFORMS · 語言模型化 ·

2023 年 11 月 2 日

Chinesewebtext: Large-scale high-quality Chinese web text extracted with effective evaluation model

Jianghao Chen,Pu Jian,Tengxiao Xi,Yidong Yi,Chenglin Ding,Qianlong Du,Guibo Zhu,Chengqing Zong,Jinqiao Wang,Jiajun Zhang

During the development of large language models (LLMs), the scale and quality of the pre-training data play a crucial role in shaping LLMs' capabilities. To accelerate the research of LLMs, several large-scale datasets, such as C4 [1], Pile [2], RefinedWeb [3] and WanJuan [4], have been released to the public. However, most of the released corpus focus mainly on English, and there is still lack of complete tool-chain for extracting clean texts from web data. Furthermore, fine-grained information of the corpus, e.g. the quality of each text, is missing. To address these challenges, we propose in this paper a new complete tool-chain EvalWeb to extract Chinese clean texts from noisy web data. First, similar to previous work, manually crafted rules are employed to discard explicit noisy texts from the raw crawled web contents. Second, a well-designed evaluation model is leveraged to assess the remaining relatively clean data, and each text is assigned a specific quality score. Finally, we can easily utilize an appropriate threshold to select the high-quality pre-training data for Chinese. Using our proposed approach, we release the largest and latest large-scale high-quality Chinese web text ChineseWebText, which consists of 1.42 TB and each text is associated with a quality score, facilitating the LLM researchers to choose the data according to the desired quality thresholds. We also release a much cleaner subset of 600 GB Chinese data with the quality exceeding 90%.

估計/估計量 · INFORMS · 邊緣化 · 層 · Extensibility ·

2023 年 11 月 1 日

Effective filtering approach for joint parameter-state estimation in SDEs via Rao-Blackwellization and modularization

Zhou Fang,Ankit Gupta,Mustafa Khammash

from arxiv, 8 pages, 2 figures

Stochastic filtering is a vibrant area of research in both control theory and statistics, with broad applications in many scientific fields. Despite its extensive historical development, there still lacks an effective method for joint parameter-state estimation in SDEs. The state-of-the-art particle filtering methods suffer from either sample degeneracy or information loss, with both issues stemming from the dynamics of the particles generated to represent system parameters. This paper provides a novel and effective approach for joint parameter-state estimation in SDEs via Rao-Blackwellization and modularization. Our method operates in two layers: the first layer estimates the system states using a bootstrap particle filter, and the second layer marginalizes out system parameters explicitly. This strategy circumvents the need to generate particles representing system parameters, thereby mitigating their associated problems of sample degeneracy and information loss. Moreover, our method employs a modularization approach when integrating out the parameters, which significantly reduces the computational complexity. All these designs ensure the superior performance of our method. Finally, a numerical example is presented to illustrate that our method outperforms existing approaches by a large margin.

未標記 · Learning · state-of-the-art · 講稿 · 訓練樣本 ·

2023 年 10 月 31 日

Open-set learning with augmented categories by exploiting unlabelled data

Emile R. Engelbrecht,Johan A. du Preez

Novel categories are commonly defined as those unobserved during training but present during testing. However, partially labelled training datasets can contain unlabelled training samples that belong to novel categories, meaning these can be present in training and testing. This research is the first to generalise between what we call observed-novel and unobserved-novel categories within a new learning policy called open-set learning with augmented category by exploiting unlabelled data or Open-LACU. After surveying existing learning policies, we introduce Open-LACU as a unified policy of positive and unlabelled learning, semi-supervised learning and open-set recognition. Subsequently, we develop the first Open-LACU model using an algorithmic training process of the relevant research fields. The proposed Open-LACU classifier achieves state-of-the-art and first-of-its-kind results.

contrastive · 學成 · 對比學習 · Extensibility · SSL ·

2020 年 6 月 18 日

Contrastive learning of global and local features for medical image segmentation with limited annotations

Krishna Chaitanya,Ertunc Erdil,Neerav Karani,Ender Konukoglu

from arxiv, 16 pages, 2 figures, 7 tables. This article is a pre-print and is currently under review at a conference

A key requirement for the success of supervised deep learning is a large labeled dataset - a condition that is difficult to meet in medical image analysis. Self-supervised learning (SSL) can help in this regard by providing a strategy to pre-train a neural network with unlabeled data, followed by fine-tuning for a downstream task with limited annotations. Contrastive learning, a particular variant of SSL, is a powerful technique for learning image-level representations. In this work, we propose strategies for extending the contrastive learning framework for segmentation of volumetric medical images in the semi-supervised setting with limited annotations, by leveraging domain-specific and problem-specific cues. Specifically, we propose (1) novel contrasting strategies that leverage structural similarity across volumetric medical images (domain-specific cue) and (2) a local version of the contrastive loss to learn distinctive representations of local regions that are useful for per-pixel segmentation (problem-specific cue). We carry out an extensive evaluation on three Magnetic Resonance Imaging (MRI) datasets. In the limited annotation setting, the proposed method yields substantial improvements compared to other self-supervision and semi-supervised learning techniques. When combined with a simple data augmentation technique, the proposed method reaches within 8% of benchmark performance using only two labeled MRI volumes for training, corresponding to only 4% (for ACDC) of the training data used to train the benchmark.

圖形處理器 · 圖 · INTERACT · Performer · Neural Networks ·

2019 年 11 月 6 日

Hyper-SAGNN: a self-attention based graph neural network for hypergraphs

Ruochi Zhang,Yuesong Zou,Jian Ma

Graph representation learning for hypergraphs can be used to extract patterns among higher-order interactions that are critically important in many real world problems. Current approaches designed for hypergraphs, however, are unable to handle different types of hypergraphs and are typically not generic for various learning tasks. Indeed, models that can predict variable-sized heterogeneous hyperedges have not been available. Here we develop a new self-attention based graph neural network called Hyper-SAGNN applicable to homogeneous and heterogeneous hypergraphs with variable hyperedge sizes. We perform extensive evaluations on multiple datasets, including four benchmark network datasets and two single-cell Hi-C datasets in genomics. We demonstrate that Hyper-SAGNN significantly outperforms the state-of-the-art methods on traditional tasks while also achieving great performance on a new task called outsider identification. Hyper-SAGNN will be useful for graph representation learning to uncover complex higher-order interactions in different applications.