国产亚洲欧美日韩精品色狠二区_97人人模人人妻人人添_久久久久亚洲无码AV专区_亚洲国产精品无码色午夜AV_视频一区二区三区免费国产_国产精品免费理论在线观看_视频一区二区三区国产在线观看

Large language models have shown remarkable aptitude in code generation, but still struggle on challenging tasks. Self-repair -- in which the model debugs and fixes mistakes in its own code -- has recently become a popular way to boost performance in these settings. However, only very limited studies on how and when self-repair works effectively exist in the literature, and one might wonder to what extent a model is really capable of repairing mistakes in code which was originally generated by that very same model. In this paper, we analyze Code Llama, GPT-3.5 and GPT-4's ability to perform self-repair on problems taken from HumanEval or APPS, finding that when the cost of carrying out repair is taken into account, gains are often modest, vary significantly between subsets of the data, and are sometimes not present at all. We hypothesize that this is because self-repair is bottlenecked by the model's ability to provide feedback on its own code; boosting the feedback with stronger models, we observe performance gains even in settings where the model does not benefit from self-repair. Finally, we find that providing the model with feedback from human participants greatly benefits repair even for GPT-4, and carry out a brief qualitative analysis of the differences observed.

相關內容

MoDELS

關注 43

ACM/IEEE第23屆模型驅動工程語言和系統國際會議，是模型驅動軟件和系統工程的首要會議系列，由ACM-SIGSOFT和IEEE-TCSE支持組織。自1998年以來，模型涵蓋了建模的各個方面，從語言和方法到工具和應用程序。模特的參加者來自不同的背景，包括研究人員、學者、工程師和工業專業人士。MODELS 2019是一個論壇，參與者可以圍繞建模和模型驅動的軟件和系統交流前沿研究成果和創新實踐經驗。今年的版本將為建模社區提供進一步推進建模基礎的機會，并在網絡物理系統、嵌入式系統、社會技術系統、云計算、大數據、機器學習、安全、開源等新興領域提出建模的創新應用以及可持續性。官網鏈接： · 情景 · GROUP · 多樣性 · state-of-the-art ·

2023 年 12 月 4 日

Know Your Audience: Do LLMs Adapt to Different Age and Education Levels?

Donya Rooein,Amanda Cercas Curry,Dirk Hovy

Large language models (LLMs) offer a range of new possibilities, including adapting the text to different audiences and their reading needs. But how well do they adapt? We evaluate the readability of answers generated by four state-of-the-art LLMs (commercial and open-source) to science questions when prompted to target different age groups and education levels. To assess the adaptability of LLMs to diverse audiences, we compare the readability scores of the generated responses against the recommended comprehension level of each age and education group. We find large variations in the readability of the answers by different LLMs. Our results suggest LLM answers need to be better adapted to the intended audience demographics to be more comprehensible. They underline the importance of enhancing the adaptability of LLMs in education settings to cater to diverse age and education levels. Overall, current LLMs have set readability ranges and do not adapt well to different audiences, even when prompted. That limits their potential for educational purposes.

統計量 · 評論員 · 基尼指數 · 查準率/準確率 · Google Scholar ·

2023 年 12 月 4 日

Do Successful Researchers Reach the Self-Organized Critical Point?

Asim Ghosh,Bikas K. Chakrabarti

from arxiv, Invited contribution to Galam Special Issue in Physics (MDPI, in press)

The index of success of the researchers is now mostly measured using the Hirsch index ($h$). Our recent precise demonstration, that statistically $h \sim \sqrt {N_c} \sim \sqrt {N_p}$, where $N_p$ and $N_c$ denote respectively the total number of publications and total citations for the researcher, suggests that average number of citations per paper ($N_c/N_p$), and hence $h$, are statistical numbers (Dunbar numbers) depending on the community or network to which the researcher belongs. We show here, extending our earlier observations, that the indications of success are not reflected by the total citations $N_c$, rather by the inequalities among citations from publications to publications. Specifically, we show that for very successful authors, the yearly variations in the Gini index ($g$, giving the average inequality of citations for the publications) and the Kolkata index ($k$, giving the fraction of total citations received by the top $1 - k$ fraction of publications; $k = 0.80$ corresponds to Pareto's 80/20 law) approach each other to $g = k \simeq 0.82$, signaling a precursor for the arrival of (or departure from) the Self-Organized Critical (SOC) state of his/her publication statistics. Analyzing the citation statistics (from Google Scholar) of thirty successful scientists throughout their recorded publication history, we find that the $g$ and $k$ for very successful among them (mostly Nobel Laureates, highest rank Stanford Cite-Scorers, and a few others) reach and hover just above (and then) below that $g = k \simeq 0.82$ mark, while for others they remain below that mark. We also find that all the lower (than the SOC mark 0.82) values of $k$ and $g$ fit a linear relationship $k = 1/2 + cg$, with $c = 0.39$.

Bioinformatics · MoDELS · 知識 (knowledge) · Performer · 代碼 ·

2023 年 12 月 4 日

BioCoder: A Benchmark for Bioinformatics Code Generation with Contextual Pragmatic Knowledge

Xiangru Tang,Bill Qian,Rick Gao,Jiakang Chen,Xinyun Chen,Mark Gerstein

Pre-trained large language models have significantly improved code generation. As these models scale up, there is an increasing need for the output to handle more intricate tasks and to be appropriately specialized to particular domains. Here, we target bioinformatics due to the amount of specialized domain knowledge, algorithms, and data operations this discipline requires. We present BioCoder, a benchmark developed to evaluate large language models (LLMs) in generating bioinformatics-specific code. BioCoder spans a broad spectrum of the field and covers cross-file dependencies, class declarations, and global variables. It incorporates 1026 Python functions and 1243 Java methods extracted from GitHub, along with 253 examples from the Rosalind Project, all pertaining to bioinformatics. Using topic modeling we show that overall coverage of the included code is representative of the full spectrum of bioinformatics calculations. BioCoder incorporates a fuzz-testing framework for evaluation. We have applied it to evaluate many models including InCoder, CodeGen, CodeGen2, SantaCoder, StarCoder, StarCoder+, InstructCodeT5+, GPT-3.5, and GPT-4. Furthermore, we finetuned StarCoder, demonstrating how our dataset can effectively enhance the performance of LLMs on our benchmark (by >15% in terms of Pass@K in certain prompt configurations and always >3%). The results highlight two key aspects of successful models: (1) Successful models accommodate a long prompt (> ~2600 tokens) with full context, for functional dependencies. (2) They contain specific domain knowledge of bioinformatics, beyond just general coding knowledge. This is evident from the performance gain of GPT-3.5/4 compared to the smaller models on the benchmark (50% vs up to ~25%). Our dataset, benchmark, Docker images, and scripts required for testing are all available at //github.com/gersteinlab/biocoder.

知識 (knowledge) · 查全率/召回率 · ChatGPT · MoDELS · 可辨認的 ·

2023 年 12 月 3 日

Why Does ChatGPT Fall Short in Providing Truthful Answers?

Shen Zheng,Jie Huang,Kevin Chen-Chuan Chang

Recent advancements in large language models, such as ChatGPT, have demonstrated significant potential to impact various aspects of human life. However, ChatGPT still faces challenges in providing reliable and accurate answers to user questions. To better understand the model's particular weaknesses in providing truthful answers, we embark an in-depth exploration of open-domain question answering. Specifically, we undertake a detailed examination of ChatGPT's failures, categorized into: comprehension, factuality, specificity, and inference. We further pinpoint factuality as the most contributing failure and identify two critical abilities associated with factuality: knowledge memorization and knowledge recall. Through experiments focusing on factuality, we propose several potential enhancement strategies. Our findings suggest that augmenting the model with granular external knowledge and cues for knowledge recall can enhance the model's factuality in answering questions.

2023 年 12 月 3 日

Unsupervised Approach to Evaluate Sentence-Level Fluency: Do We Really Need Reference?

Gopichand Kanumolu,Lokesh Madasu,Pavan Baswani,Ananya Mukherjee,Manish Shrivastava

from arxiv, Accepted at IJCNLP-AACL SEALP Workshop

Fluency is a crucial goal of all Natural Language Generation (NLG) systems. Widely used automatic evaluation metrics fall short in capturing the fluency of machine-generated text. Assessing the fluency of NLG systems poses a challenge since these models are not limited to simply reusing words from the input but may also generate abstractions. Existing reference-based fluency evaluations, such as word overlap measures, often exhibit weak correlations with human judgments. This paper adapts an existing unsupervised technique for measuring text fluency without the need for any reference. Our approach leverages various word embeddings and trains language models using Recurrent Neural Network (RNN) architectures. We also experiment with other available multilingual Language Models (LMs). To assess the performance of the models, we conduct a comparative analysis across 10 Indic languages, correlating the obtained fluency scores with human judgments. Our code and human-annotated benchmark test-set for fluency is available at //github.com/AnanyaCoder/TextFluencyForIndicLanaguges.

SQL · 總回報 · Better · Performance · 聲明 ·

2023 年 12 月 1 日

What if an SQL Statement Returned a Database?

Joris Nix,Jens Dittrich

Every SQL statement is limited to return a single, possibly denormalized, table. This design decision has far reaching consequences. (1.) for databases users in terms of slow query performance, long query result transfer times, usability-issues of SQL in web applications and object-relational mappers. In addition, (2.) for database architects it has consequences when designing query optimizers leading to logical (algebraic) join enumeration effort, memory consumption for intermediate result materialization, and physical operator selection effort. So basically, the entire query optimization stack is shaped by that design decision. In this paper, we argue that the single-table limitation should be dropped. We extend the SELECT-clause of SQL by a keyword 'RESULTDB' to support returning a result database. Our approach has clear semantics, i.e. our extended SQL returns subsets of all tables with only those tuples that would be part of the traditional (single-table) query result set, however without performing any denormalization through joins. Our SQL-extension is downward compatible. Moreover, we discuss the surprisingly long list of benefits of our approach. First, for database users: far simpler and more readable application code, better query performance, smaller query results, better query result transfer times. Second, for database architects, we present how to leverage existing closed source systems as well as change open source database systems to support our feature. We propose a couple of algorithms to integrate our feature into both closed-source as well as open source database systems. We present an initial experimental study with promising results.

表示 · 代碼 · 可理解性 · MoDELS · Performer ·

2023 年 12 月 1 日

Abstract Syntax Tree for Programming Language Understanding and Representation: How Far Are We?

Weisong Sun,Chunrong Fang,Yun Miao,Yudu You,Mengzhe Yuan,Yuchen Chen,Quanjun Zhang,An Guo,Xiang Chen,Yang Liu,Zhenyu Chen

from arxiv, submitted to ACM Transactions on Software Engineering and Methodology. arXiv admin note: text overlap with arXiv:2103.10668 by other authors

Programming language understanding and representation (a.k.a code representation learning) has always been a hot and challenging task in software engineering. It aims to apply deep learning techniques to produce numerical representations of the source code features while preserving its semantics. These representations can be used for facilitating subsequent code-related tasks. The abstract syntax tree (AST), a fundamental code feature, illustrates the syntactic information of the source code and has been widely used in code representation learning. However, there is still a lack of systematic and quantitative evaluation of how well AST-based code representation facilitates subsequent code-related tasks. In this paper, we first conduct a comprehensive empirical study to explore the effectiveness of the AST-based code representation in facilitating follow-up code-related tasks. To do so, we compare the performance of models trained with code token sequence (Token for short) based code representation and AST-based code representation on three popular types of code-related tasks. Surprisingly, the overall quantitative statistical results demonstrate that models trained with AST-based code representation consistently perform worse across all three tasks compared to models trained with Token-based code representation. Our further quantitative analysis reveals that models trained with AST-based code representation outperform models trained with Token-based code representation in certain subsets of samples across all three tasks. We also conduct comprehensive experiments to evaluate and reveal the impact of the choice of AST parsing/preprocessing/encoding methods on AST-based code representation and subsequent code-related tasks. Our study provides future researchers with detailed guidance on how to select solutions at each stage to fully exploit AST.

MoDELS · 估計/估計量 · Extensibility · 預測器/決策函數 · state-of-the-art ·

2023 年 11 月 30 日

Exploiting Diffusion Prior for Generalizable Pixel-Level Semantic Prediction

Hsin-Ying Lee,Hung-Yu Tseng,Hsin-Ying Lee,Ming-Hsuan Yang

from arxiv, Project page: //shinying.github.io/dmp

Contents generated by recent advanced Text-to-Image (T2I) diffusion models are sometimes too imaginative for existing off-the-shelf property semantic predictors to estimate due to the immitigable domain gap. We introduce DMP, a pipeline utilizing pre-trained T2I models as a prior for pixel-level semantic prediction tasks. To address the misalignment between deterministic prediction tasks and stochastic T2I models, we reformulate the diffusion process through a sequence of interpolations, establishing a deterministic mapping between input RGB images and output prediction distributions. To preserve generalizability, we use low-rank adaptation to fine-tune pre-trained models. Extensive experiments across five tasks, including 3D property estimation, semantic segmentation, and intrinsic image decomposition, showcase the efficacy of the proposed method. Despite limited-domain training data, the approach yields faithful estimations for arbitrary images, surpassing existing state-of-the-art algorithms.

AdderNet · Neural Networks · Networking · 卷積 · 模型評估 ·

2019 年 12 月 31 日

AdderNet: Do We Really Need Multiplications in Deep Learning?

Hanting Chen,Yunhe Wang,Chunjing Xu,Boxin Shi,Chao Xu,Qi Tian,Chang Xu

Compared with cheap addition operation, multiplication operation is of much higher computation complexity. The widely-used convolutions in deep neural networks are exactly cross-correlation to measure the similarity between input feature and convolution filters, which involves massive multiplications between float values. In this paper, we present adder networks (AdderNets) to trade these massive multiplications in deep neural networks, especially convolutional neural networks (CNNs), for much cheaper additions to reduce computation costs. In AdderNets, we take the $\ell_1$-norm distance between filters and input feature as the output response. The influence of this new similarity measure on the optimization of neural network have been thoroughly analyzed. To achieve a better performance, we develop a special back-propagation approach for AdderNets by investigating the full-precision gradient. We then propose an adaptive learning rate strategy to enhance the training procedure of AdderNets according to the magnitude of each neuron's gradient. As a result, the proposed AdderNets can achieve 74.9% Top-1 accuracy 91.7% Top-5 accuracy using ResNet-50 on the ImageNet dataset without any multiplication in convolution layer.

文本分類 · 語言模型化 · BERT · state-of-the-art · MoDELS ·

2019 年 5 月 14 日

How to Fine-Tune BERT for Text Classification?

Chi Sun,Xipeng Qiu,Yige Xu,Xuanjing Huang

Language model pre-training has proven to be useful in learning universal language representations. As a state-of-the-art language model pre-training model, BERT (Bidirectional Encoder Representations from Transformers) has achieved amazing results in many language understanding tasks. In this paper, we conduct exhaustive experiments to investigate different fine-tuning methods of BERT on text classification task and provide a general solution for BERT fine-tuning. Finally, the proposed solution obtains new state-of-the-art results on eight widely-studied text classification datasets.