国产成人精品三级在线,国产日本亚洲欧美一区二区,亚洲综合日本一区二区三区四区

from arxiv, 58 pages, 8 tables, 6 figures. Substantial changes made to version 2: New section 4.1 added (including a new table); Minor normalization issue corrected in values listed in Appendix B; Content of former appendix C now moved to Section 3; and new Appendix C added. Minor changes made to version 3 (style, typos, language)

This work proposes to measure the scope of a patent claim as the reciprocal of the self-information contained in this claim. A probability of occurrence of the claim is obtained from a language model and this probability is used to compute the self-information. Grounded in information theory, this approach is based on the assumption that an unlikely concept is more informative than a usual concept, insofar as it is more surprising. In turn, the more surprising the information required to defined the claim, the narrower its scope. Five language models are considered, ranging from simplest models (each word or character is assigned an identical probability) to intermediate models (using average word or character frequencies), to a large language model (GPT2). Interestingly, the scope resulting from the simplest language models is proportional to the reciprocal of the number of words or characters involved in the claim, a metric already used in previous works. Application is made to multiple series of patent claims directed to distinct inventions, where each series consists of claims devised to have a gradually decreasing scope. The performance of the language models is assessed with respect to several ad hoc tests. The more sophisticated the model, the better the results. I.e., the GPT2 probability model outperforms models based on word and character frequencies, which themselves outdo the simplest models based on word or character counts. Still, the character count appears to be a more reliable indicator than the word count.

相關內容

語言模(mo)型化

關注 9

縮放 · 等變 · Networking · Performer · 表示 ·

2024 年 1 月 11 日

Riesz feature representation: scale equivariant scattering network for classification tasks

Tin Barisin,Jesus Angulo,Katja Schladitz,Claudia Redenbach

Scattering networks yield powerful and robust hierarchical image descriptors which do not require lengthy training and which work well with very few training data. However, they rely on sampling the scale dimension. Hence, they become sensitive to scale variations and are unable to generalize to unseen scales. In this work, we define an alternative feature representation based on the Riesz transform. We detail and analyze the mathematical foundations behind this representation. In particular, it inherits scale equivariance from the Riesz transform and completely avoids sampling of the scale dimension. Additionally, the number of features in the representation is reduced by a factor four compared to scattering networks. Nevertheless, our representation performs comparably well for texture classification with an interesting addition: scale equivariance. Our method yields superior performance when dealing with scales outside of those covered by the training dataset. The usefulness of the equivariance property is demonstrated on the digit classification task, where accuracy remains stable even for scales four times larger than the one chosen for training. As a second example, we consider classification of textures.

LIDAR · 數據獲取 · INFORMS · Processing（編程語言） · Integration ·

2024 年 1 月 11 日

LiDAR data acquisition and processing for ecology applications

Ion Ciobotari,Adriana Príncipe,Maria Alexandra Oliveira,Jo?o Nuno Silva

The collection of ecological data in the field is essential to diagnose, monitor and manage ecosystems in a sustainable way. Since acquisition of this information through traditional methods are generally time-consuming, due to the capability of recording large volumes of data in short time periods, automation of data acquisition sees a growing trend. Terrestrial laser scanners (TLS), particularly LiDAR sensors, have been used in ecology, allowing to reconstruct the 3D structure of vegetation, and thus, infer ecosystem characteristics based on the spatial variation of the density of points. However, the low amount of information obtained per beam, lack of data analysis tools and the high cost of the equipment limit their use. This way, a low-cost TLS (<10k$) was developed along with data acquisition and processing mechanisms applicable in two case studies: an urban garden and a target area for ecological restoration. The orientation of LiDAR was modified to make observations in the vertical plane and a motor was integrated for its rotation, enabling the acquisition of 360 degree data with high resolution. Motion and location sensors were also integrated for automatic error correction and georeferencing. From the data generated, histograms of point density variation along the vegetation height were created, where shrub stratum was easily distinguishable from tree stratum, and maximum tree height and shrub cover were calculated. These results agreed with the field data, whereby the developed TLS has proved to be effective in calculating metrics of structural complexity of vegetation.

線性回歸 · 線性的 · 情景 · 值域 · 相關系數 ·

2024 年 1 月 10 日

Shrinkage linear regression for symbolic interval-valued variables

Oldemar Rodriguez

from arxiv, 22 pages

This paper proposes a new approach to fit a linear regression for symbolic internal-valued variables, which improves both the Center Method suggested by Billard and Diday in \cite{BillardDiday2000} and the Center and Range Method suggested by Lima-Neto, E.A. and De Carvalho, F.A.T. in \cite{Lima2008, Lima2010}. Just in the Centers Method and the Center and Range Method, the new methods proposed fit the linear regression model on the midpoints and in the half of the length of the intervals as an additional variable (ranges) assumed by the predictor variables in the training data set, but to make these fitments in the regression models, the methods Ridge Regression, Lasso, and Elastic Net proposed by Tibshirani, R. Hastie, T., and Zou H in \cite{Tib1996, HastieZou2005} are used. The prediction of the lower and upper of the interval response (dependent) variable is carried out from their midpoints and ranges, which are estimated from the linear regression models with shrinkage generated in the midpoints and the ranges of the interval-valued predictors. Methods presented in this document are applied to three real data sets cardiologic interval data set, Prostate interval data set and US Murder interval data set to then compare their performance and facility of interpretation regarding the Center Method and the Center and Range Method. For this evaluation, the root-mean-squared error and the correlation coefficient are used. Besides, the reader may use all the methods presented herein and verify the results using the {\tt RSDA} package written in {\tt R} language, that can be downloaded and installed directly from {\tt CRAN} \cite{Rod2014}.

估計/估計量 · Networking · 近似 · Neural Networks · Integration ·

2024 年 1 月 10 日

Error estimation for physics-informed neural networks with implicit Runge-Kutta methods

Jochen Stiasny,Spyros Chatzivasileiadis

from arxiv, Submitted to the 6th Annual Conference on Learning for Dynamics and Control, 2024

The ability to accurately approximate trajectories of dynamical systems enables their analysis, prediction, and control. Neural network (NN)-based approximations have attracted significant interest due to fast evaluation with good accuracy over long integration time steps. In contrast to established numerical approximation schemes such as Runge-Kutta methods, the estimation of the error of the NN-based approximations proves to be difficult. In this work, we propose to use the NN's predictions in a high-order implicit Runge-Kutta (IRK) method. The residuals in the implicit system of equations can be related to the NN's prediction error, hence, we can provide an error estimate at several points along a trajectory. We find that this error estimate highly correlates with the NN's prediction error and that increasing the order of the IRK method improves this estimate. We demonstrate this estimation methodology for Physics-Informed Neural Network (PINNs) on the logistic equation as an illustrative example and then apply it to a four-state electric generator model that is regularly used in power system modelling.

有偏 · Machine Translation · 機器翻譯 · Less · 錯誤率 ·

2024 年 1 月 10 日

Whose wife is it anyway? Assessing bias against same-gender relationships in machine translation

Ian Stewart,Rada Mihalcea

Machine translation often suffers from biased data and algorithms that can lead to unacceptable errors in system output. While bias in gender norms has been investigated, less is known about whether MT systems encode bias about social relationships, e.g. sentences such as "the lawyer kissed her wife." We investigate the degree of bias against same-gender relationships in MT systems, using generated template sentences drawn from several noun-gender languages (e.g. Spanish). We find that three popular MT services consistently fail to accurately translate sentences concerning relationships between nouns of the same gender. The error rate varies considerably based on the context, e.g. same-gender sentences referencing high female-representation occupations are translated with lower accuracy. We provide this work as a case study in the evaluation of intrinsic bias in NLP systems, with respect to social relationships.

正則化項 · Weight · 可辨認的 · 設計 · 特化 ·

2024 年 1 月 10 日

Weighted inhomogeneous regularization for inverse problems with indirect and incomplete measurement data

Bosu Choi,Jihun Han,Yoonsang Lee

Regularization promotes well-posedness in solving an inverse problem with incomplete measurement data. The regularization term is typically designed based on a priori characterization of the unknown signal, such as sparsity or smoothness. The standard inhomogeneous regularization incorporates a spatially changing exponent $p$ of the standard $\ell_p$ norm-based regularization to recover a signal whose characteristic varies spatially. This study proposes a weighted inhomogeneous regularization that extends the standard inhomogeneous regularization through new exponent design and weighting using spatially varying weights. The new exponent design avoids misclassification when different characteristics stay close to each other. The weights handle another issue when the region of one characteristic is too small to be recovered effectively by the $\ell_p$ norm-based regularization even after identified correctly. A suite of numerical tests shows the efficacy of the proposed weighted inhomogeneous regularization, including synthetic image experiments and real sea ice recovery from its incomplete wave measurements.

2024 年 1 月 9 日

DepressionEmo: A novel dataset for multilabel classification of depression emotions

Abu Bakar Siddiqur Rahman,Hoang-Thang Ta,Lotfollah Najjar,Azad Azadmanesh,Ali Saffet G?nül

from arxiv, 18 pages

Emotions are integral to human social interactions, with diverse responses elicited by various situational contexts. Particularly, the prevalence of negative emotional states has been correlated with negative outcomes for mental health, necessitating a comprehensive analysis of their occurrence and impact on individuals. In this paper, we introduce a novel dataset named DepressionEmo designed to detect 8 emotions associated with depression by 6037 examples of long Reddit user posts. This dataset was created through a majority vote over inputs by zero-shot classifications from pre-trained models and validating the quality by annotators and ChatGPT, exhibiting an acceptable level of interrater reliability between annotators. The correlation between emotions, their distribution over time, and linguistic analysis are conducted on DepressionEmo. Besides, we provide several text classification methods classified into two groups: machine learning methods such as SVM, XGBoost, and Light GBM; and deep learning methods such as BERT, GAN-BERT, and BART. The pretrained BART model, bart-base allows us to obtain the highest F1- Macro of 0.76, showing its outperformance compared to other methods evaluated in our analysis. Across all emotions, the highest F1-Macro value is achieved by suicide intent, indicating a certain value of our dataset in identifying emotions in individuals with depression symptoms through text analysis. The curated dataset is publicly available at: //github.com/abuBakarSiddiqurRahman/DepressionEmo.

大語言模型 · 語言模型化 · Processing（編程語言） · MoDELS · Projection ·

2024 年 1 月 9 日

TechGPT-2.0: A large language model project to solve the task of knowledge graph construction

Jiaqi Wang,Yuying Chang,Zhong Li,Ning An,Qi Ma,Lei Hei,Haibo Luo,Yifei Lu,Feiliang Ren

Large language models have exhibited robust performance across diverse natural language processing tasks. This report introduces TechGPT-2.0, a project designed to enhance the capabilities of large language models specifically in knowledge graph construction tasks, including named entity recognition (NER) and relationship triple extraction (RTE) tasks in NLP applications. Additionally, it serves as a LLM accessible for research within the Chinese open-source model community. We offer two 7B large language model weights and a QLoRA weight specialized for processing lengthy texts.Notably, TechGPT-2.0 is trained on Huawei's Ascend server. Inheriting all functionalities from TechGPT-1.0, it exhibits robust text processing capabilities, particularly in the domains of medicine and law. Furthermore, we introduce new capabilities to the model, enabling it to process texts in various domains such as geographical areas, transportation, organizations, literary works, biology, natural sciences, astronomical objects, and architecture. These enhancements also fortified the model's adeptness in handling hallucinations, unanswerable queries, and lengthy texts. This report provides a comprehensive and detailed introduction to the full fine-tuning process on Huawei's Ascend servers, encompassing experiences in Ascend server debugging, instruction fine-tuning data processing, and model training. Our code is available at //github.com/neukg/TechGPT-2.0

有偏 · Performer · 相關系數 · 得分 · 推斷 ·

2024 年 1 月 9 日

Adjusting for indirectly measured confounding using large-scale propensity scores

Linying Zhang,Yixin Wang,Martijn Schuemie,David Blei,George Hripcsak

Confounding remains one of the major challenges to causal inference with observational data. This problem is paramount in medicine, where we would like to answer causal questions from large observational datasets like electronic health records (EHRs) and administrative claims. Modern medical data typically contain tens of thousands of covariates. Such a large set carries hope that many of the confounders are directly measured, and further hope that others are indirectly measured through their correlation with measured covariates. How can we exploit these large sets of covariates for causal inference? To help answer this question, this paper examines the performance of the large-scale propensity score (LSPS) approach on causal analysis of medical data. We demonstrate that LSPS may adjust for indirectly measured confounders by including tens of thousands of covariates that may be correlated with them. We present conditions under which LSPS removes bias due to indirectly measured confounders, and we show that LSPS may avoid bias when inadvertently adjusting for variables (like colliders) that otherwise can induce bias. We demonstrate the performance of LSPS with both simulated medical data and real medical data.

語言模型化 · 大語言模型 · Bioinformatics · MoDELS · Processing（編程語言） ·

2024 年 1 月 8 日

Large language models in bioinformatics: applications and perspectives

Jiajia Liu,Mengyuan Yang,Yankai Yu,Haixia Xu,Kang Li,Xiaobo Zhou

from arxiv, 7 figures

Large language models (LLMs) are a class of artificial intelligence models based on deep learning, which have great performance in various tasks, especially in natural language processing (NLP). Large language models typically consist of artificial neural networks with numerous parameters, trained on large amounts of unlabeled input using self-supervised or semi-supervised learning. However, their potential for solving bioinformatics problems may even exceed their proficiency in modeling human language. In this review, we will present a summary of the prominent large language models used in natural language processing, such as BERT and GPT, and focus on exploring the applications of large language models at different omics levels in bioinformatics, mainly including applications of large language models in genomics, transcriptomics, proteomics, drug discovery and single cell analysis. Finally, this review summarizes the potential and prospects of large language models in solving bioinformatic problems.