Neural Code Completion Tools (NCCTs) have reshaped the field of software development, which accurately suggest contextually-relevant code snippets benefiting from language modeling techniques. However, language models may emit the training data verbatim during inference with appropriate prompts. This memorization property raises privacy concerns of commercial NCCTs about the hard-coded credential leakage, leading to unauthorized access to systems. Therefore, to answer whether NCCTs will inadvertently emit the hard-coded credential, we propose an evaluation tool called Hard-coded Credential Revealer (HCR). HCR effectively constructs test prompts from GitHub code files with credentials to trigger memorization phenomenon of commercial NCCTs. Then, HCR extracts credentials with pre-defined format from the responses by four designed filters. We apply HCR to evaluate two representative commercial NCCTs: GitHub Copilot and Amazon CodeWhisperer and successfully extracted 2,702 hard-coded credentials from Copilot and 129 secrets from CodeWhisper under the black-box setting, among which at least 3.6% and 5.4% secrets are real strings from GitHub repositories. Moreover, two operational credentials were identified. The experimental results raise the severe privacy concern of the potential leakage of hard-coded credentials in the training data of commercial NCCTs.
With the development of trustworthy Federated Learning (FL), the requirement of implementing right to be forgotten gives rise to the area of Federated Unlearning (FU). Comparing to machine unlearning, a major challenge of FU lies in the decentralized and privacy-preserving nature of FL, in which clients jointly train a global model without sharing their raw data, making it substantially more intricate to selectively unlearn specific information. In that regard, many efforts have been made to tackle the challenges of FU and have achieved significant progress. In this paper, we present a comprehensive survey of FU. Specially, we provide the existing algorithms, objectives, evaluation metrics, and identify some challenges of FU. By reviewing and comparing some studies, we summarize them into a taxonomy for various schemes, potential applications and future directions.
Large language models (LLMs) have great potential for synthetic data generation. This work shows that useful data can be synthetically generated even for tasks that cannot be solved directly by LLMs: for problems with structured outputs, it is possible to prompt an LLM to perform the task in the reverse direction, by generating plausible input text for a target output structure. Leveraging this asymmetry in task difficulty makes it possible to produce large-scale, high-quality data for complex tasks. We demonstrate the effectiveness of this approach on closed information extraction, where collecting ground-truth data is challenging, and no satisfactory dataset exists to date. We synthetically generate a dataset of 1.8M data points, establish its superior quality compared to existing datasets in a human evaluation, and use it to finetune small models (220M and 770M parameters), termed SynthIE, that outperform the prior state of the art (with equal model size) by a substantial margin of 57 absolute points in micro-F1 and 79 points in macro-F1. Code, data, and models are available at //github.com/epfl-dlab/SynthIE.
Deep learning has revolutionized various real-world applications, but the quality of Deep Neural Networks (DNNs) remains a concern. DNNs are complex and have millions of parameters, making it difficult to determine their contributions to fulfilling a task. Moreover, the behavior of a DNN is highly influenced by the data used during training, making it challenging to collect enough data to exercise all potential DNN behavior under all possible scenarios. This paper proposes a novel NP-SBFL method that adapts spectrum-based fault localization (SBFL) to locate faulty neural pathways. Our method identifies critical neurons using the layer-wise relevance propagation (LRP) technique and determines which critical neurons are faulty. We propose a multi-stage gradient ascent (MGA), an extension of gradient ascent, to effectively activate a sequence of neurons one at a time while maintaining the activation of previous neurons. We evaluated the effectiveness of our method on two commonly used datasets, MNIST and CIFAR-10, two baselines DeepFault and NP-SBFL-GA, and three suspicious neuron measures, Tarantula, Ochiai, and Barinel. The empirical results showed that NP-SBFL-MGA is statistically more effective than the baselines at identifying suspicious paths and synthesizing adversarial inputs. Particularly, Tarantula on NP-SBFL-MGA had the highest fault detection rate at 96.75%, surpassing DeepFault on Ochiai (89.90%) and NP-SBFL-GA on Ochiai (60.61%). Our approach also yielded comparable results to the baselines in synthesizing naturalness inputs, and we found a positive correlation between the coverage of critical paths and the number of failed tests in DNN fault localization.
Large language models (LLMs), like ChatGPT, have shown some human-like cognitive abilities. For comparing these abilities of different models, several benchmarks (i.e. sets of standard test questions) from different fields (e.g., Literature, Biology and Psychology) are often adopted and the test results under traditional metrics such as accuracy, recall and F1, are reported. However, such way for evaluating LLMs can be inefficient and inaccurate from the cognitive science perspective. Inspired by Computerized Adaptive Testing (CAT) used in psychometrics, we propose an adaptive testing framework for LLM evaluation. Rather than using a standard test set and simply reporting accuracy, this approach dynamically adjusts the characteristics of the test questions, such as difficulty, based on the model's performance. This allows for a more accurate estimation of the model's abilities, using fewer questions. More importantly, it allows LLMs to be compared with humans easily, which is essential for NLP models that aim for human-level ability. Our diagnostic reports have found that ChatGPT often behaves like a ``careless student'', prone to slip and occasionally guessing the questions. We conduct a fine-grained diagnosis and rank the latest 6 instruction-tuned LLMs from three aspects of Subject Knowledge, Mathematical Reasoning, and Programming, where GPT4 can outperform other models significantly and reach the cognitive ability of middle-level students. Different tests for different models using efficient adaptive testing -- we believe this has the potential to become a new norm in evaluating large language models.
Generative Artificial Intelligence (GenAI) tools have become increasingly prevalent in software development, offering assistance to various managerial and technical project activities. Notable examples of these tools include OpenAIs ChatGPT, GitHub Copilot, and Amazon CodeWhisperer. Although many recent publications have explored and evaluated the application of GenAI, a comprehensive understanding of the current development, applications, limitations, and open challenges remains unclear to many. Particularly, we do not have an overall picture of the current state of GenAI technology in practical software engineering usage scenarios. We conducted a literature review and focus groups for a duration of five months to develop a research agenda on GenAI for Software Engineering. We identified 78 open Research Questions (RQs) in 11 areas of Software Engineering. Our results show that it is possible to explore the adoption of GenAI in partial automation and support decision-making in all software development activities. While the current literature is skewed toward software implementation, quality assurance and software maintenance, other areas, such as requirements engineering, software design, and software engineering education, would need further research attention. Common considerations when implementing GenAI include industry-level assessment, dependability and accuracy, data accessibility, transparency, and sustainability aspects associated with the technology. GenAI is bringing significant changes to the field of software engineering. Nevertheless, the state of research on the topic still remains immature. We believe that this research agenda holds significance and practical value for informing both researchers and practitioners about current applications and guiding future research.
The emerging trend of AR/VR places great demands on 3D content. However, most existing software requires expertise and is difficult for novice users to use. In this paper, we aim to create sketch-based modeling tools for user-friendly 3D modeling. We introduce Reality3DSketch with a novel application of an immersive 3D modeling experience, in which a user can capture the surrounding scene using a monocular RGB camera and can draw a single sketch of an object in the real-time reconstructed 3D scene. A 3D object is generated and placed in the desired location, enabled by our novel neural network with the input of a single sketch. Our neural network can predict the pose of a drawing and can turn a single sketch into a 3D model with view and structural awareness, which addresses the challenge of sparse sketch input and view ambiguity. We conducted extensive experiments synthetic and real-world datasets and achieved state-of-the-art (SOTA) results in both sketch view estimation and 3D modeling performance. According to our user study, our method of performing 3D modeling in a scene is $>$5x faster than conventional methods. Users are also more satisfied with the generated 3D model than the results of existing methods.
A major challenge in the practical use of Machine Translation (MT) is that users lack guidance to make informed decisions about when to rely on outputs. Progress in quality estimation research provides techniques to automatically assess MT quality, but these techniques have primarily been evaluated in vitro by comparison against human judgments outside of a specific context of use. This paper evaluates quality estimation feedback in vivo with a human study simulating decision-making in high-stakes medical settings. Using Emergency Department discharge instructions, we study how interventions based on quality estimation versus backtranslation assist physicians in deciding whether to show MT outputs to a patient. We find that quality estimation improves appropriate reliance on MT, but backtranslation helps physicians detect more clinically harmful errors that QE alone often misses.
The Pretrained Foundation Models (PFMs) are regarded as the foundation for various downstream tasks with different data modalities. A pretrained foundation model, such as BERT, GPT-3, MAE, DALLE-E, and ChatGPT, is trained on large-scale data which provides a reasonable parameter initialization for a wide range of downstream applications. The idea of pretraining behind PFMs plays an important role in the application of large models. Different from previous methods that apply convolution and recurrent modules for feature extractions, the generative pre-training (GPT) method applies Transformer as the feature extractor and is trained on large datasets with an autoregressive paradigm. Similarly, the BERT apples transformers to train on large datasets as a contextual language model. Recently, the ChatGPT shows promising success on large language models, which applies an autoregressive language model with zero shot or few show prompting. With the extraordinary success of PFMs, AI has made waves in a variety of fields over the past few years. Considerable methods, datasets, and evaluation metrics have been proposed in the literature, the need is raising for an updated survey. This study provides a comprehensive review of recent research advancements, current and future challenges, and opportunities for PFMs in text, image, graph, as well as other data modalities. We first review the basic components and existing pretraining in natural language processing, computer vision, and graph learning. We then discuss other advanced PFMs for other data modalities and unified PFMs considering the data quality and quantity. Besides, we discuss relevant research about the fundamentals of the PFM, including model efficiency and compression, security, and privacy. Finally, we lay out key implications, future research directions, challenges, and open problems.
In pace with developments in the research field of artificial intelligence, knowledge graphs (KGs) have attracted a surge of interest from both academia and industry. As a representation of semantic relations between entities, KGs have proven to be particularly relevant for natural language processing (NLP), experiencing a rapid spread and wide adoption within recent years. Given the increasing amount of research work in this area, several KG-related approaches have been surveyed in the NLP research community. However, a comprehensive study that categorizes established topics and reviews the maturity of individual research streams remains absent to this day. Contributing to closing this gap, we systematically analyzed 507 papers from the literature on KGs in NLP. Our survey encompasses a multifaceted review of tasks, research types, and contributions. As a result, we present a structured overview of the research landscape, provide a taxonomy of tasks, summarize our findings, and highlight directions for future work.
Graph Neural Networks (GNNs) have been studied from the lens of expressive power and generalization. However, their optimization properties are less well understood. We take the first step towards analyzing GNN training by studying the gradient dynamics of GNNs. First, we analyze linearized GNNs and prove that despite the non-convexity of training, convergence to a global minimum at a linear rate is guaranteed under mild assumptions that we validate on real-world graphs. Second, we study what may affect the GNNs' training speed. Our results show that the training of GNNs is implicitly accelerated by skip connections, more depth, and/or a good label distribution. Empirical results confirm that our theoretical results for linearized GNNs align with the training behavior of nonlinear GNNs. Our results provide the first theoretical support for the success of GNNs with skip connections in terms of optimization, and suggest that deep GNNs with skip connections would be promising in practice.