Sign-Perturbed Sum (SPS) is a powerful finite-sample system identification algorithm which can construct confidence regions for the true data generating system with exact coverage probabilities, for any finite sample size. SPS was developed in a series of papers and it has a wide range of applications, from general linear systems, even in a closed-loop setup, to nonlinear and nonparametric approaches. Although several theoretical properties of SPS were proven in the literature, the sample complexity of the method was not analysed so far. This paper aims to fill this gap and provides the first results on the sample complexity of SPS. Here, we focus on scalar linear regression problems, that is we study the behaviour of SPS confidence intervals. We provide high probability upper bounds, under three different sets of assumptions, showing that the sizes of SPS confidence intervals shrink at a geometric rate around the true parameter, if the observation noises are subgaussian. We also show that similar bounds hold for the previously proposed outer approximation of the confidence region. Finally, we present simulation experiments comparing the theoretical and the empirical convergence rates.
User alignment is crucial for adapting general-purpose language models (LMs) to downstream tasks, but human annotations are often not available for all types of instructions, especially those with customized constraints. We observe that user instructions typically contain constraints. While assessing response quality in terms of the whole instruction is often costly, efficiently evaluating the satisfaction rate of constraints is feasible. We investigate common constraints in NLP tasks, categorize them into three classes based on the types of their arguments, and propose a unified framework, ACT (Aligning to ConsTraints), to automatically produce supervision signals for user alignment with constraints. Specifically, ACT uses constraint verifiers, which are typically easy to implement in practice, to compute constraint satisfaction rate (CSR) of each response. It samples multiple responses for each prompt and collect preference labels based on their CSR automatically. Subsequently, ACT adapts the LM to the target task through a ranking-based learning process. Experiments on fine-grained entity typing, abstractive summarization, and temporal question answering show that ACT is able to enhance LMs' capability to adhere to different classes of constraints, thereby improving task performance. Further experiments show that the constraint-following capabilities are transferable.
Autonomous vehicles (AVs) may use external interfaces, such as LED light bands, to communicate with pedestrians safely and intuitively. While previous research has demonstrated the effectiveness of these interfaces in simple traffic scenarios involving one pedestrian and one vehicle, their performance in more complex scenarios with multiple road users remains unclear. The scalability of AV external communication has therefore attracted increasing attention, prompting the need for further investigation. This scoping review synthesises information from 54 papers to identify seven key scalability issues in multi-vehicle and multi-pedestrian environments, with Clarity of Recipients, Information Overload, and Multi-Lane Safety emerging as the most pressing concerns. To guide future research in scalable AV-pedestrian interactions, we propose high-level design directions focused on three communication loci: vehicle, infrastructure, and pedestrian. Our work contributes the groundwork and a roadmap for designing simplified, coordinated, and targeted external AV communication, ultimately improving safety and efficiency in complex traffic scenarios.
We prove a priori and a posteriori error estimates for physics-informed neural networks (PINNs) for linear PDEs. We analyze elliptic equations in primal and mixed form, elasticity, parabolic, hyperbolic and Stokes equations; and a PDE constrained optimization problem. For the analysis, we propose an abstract framework in the common language of bilinear forms, and we show that coercivity and continuity lead to error estimates. The obtained estimates are sharp and reveal that the $L^2$ penalty approach for initial and boundary conditions in the PINN formulation weakens the norm of the error decay. Finally, utilizing recent advances in PINN optimization, we present numerical examples that illustrate the ability of the method to achieve accurate solutions.
Batch Normalization's (BN) unique property of depending on other samples in a batch is known to cause problems in several tasks, including sequence modeling. Yet, BN-related issues are hardly studied for long video understanding, despite the ubiquitous use of BN in CNNs (Convolutional Neural Networks) for feature extraction. Especially in surgical workflow analysis, where the lack of pretrained feature extractors has led to complex, multi-stage training pipelines, limited awareness of BN issues may have hidden the benefits of training CNNs and temporal models end to end. In this paper, we analyze pitfalls of BN in video learning, including issues specific to online tasks such as a 'cheating' effect in anticipation. We observe that BN's properties create major obstacles for end-to-end learning. However, using BN-free backbones, even simple CNN-LSTMs beat the state of the art {\color{\colorrevtwo}on three surgical workflow benchmarks} by utilizing adequate end-to-end training strategies which maximize temporal context. We conclude that awareness of BN's pitfalls is crucial for effective end-to-end learning in surgical tasks. By reproducing results on natural-video datasets, we hope our insights will benefit other areas of video learning as well. Code is available at: \url{//gitlab.com/nct_tso_public/pitfalls_bn}
Large language models (LLMs) are complex artificial intelligence systems capable of understanding, generating and translating human language. They learn language patterns by analyzing large amounts of text data, allowing them to perform writing, conversation, summarizing and other language tasks. When LLMs process and generate large amounts of data, there is a risk of leaking sensitive information, which may threaten data privacy. This paper concentrates on elucidating the data privacy concerns associated with LLMs to foster a comprehensive understanding. Specifically, a thorough investigation is undertaken to delineate the spectrum of data privacy threats, encompassing both passive privacy leakage and active privacy attacks within LLMs. Subsequently, we conduct an assessment of the privacy protection mechanisms employed by LLMs at various stages, followed by a detailed examination of their efficacy and constraints. Finally, the discourse extends to delineate the challenges encountered and outline prospective directions for advancement in the realm of LLM privacy protection.
Large language models (LLMs) are demonstrating remarkable capabilities across various tasks despite lacking a foundation in human cognition. This raises the question: can these models, beyond simply mimicking human language patterns, offer insights into the mechanisms underlying human cognition? This study explores the ability of ChatGPT to predict human performance in a language-based memory task. Building upon theories of text comprehension, we hypothesize that recognizing ambiguous sentences (e.g., "Because Bill drinks wine is never kept in the house") is facilitated by preceding them with contextually relevant information. Participants, both human and ChatGPT, were presented with pairs of sentences. The second sentence was always a garden-path sentence designed to be inherently ambiguous, while the first sentence either provided a fitting (e.g., "Bill has chronic alcoholism") or an unfitting context (e.g., "Bill likes to play golf"). We measured both human's and ChatGPT's ratings of sentence relatedness, ChatGPT's memorability ratings for the garden-path sentences, and humans' spontaneous memory for the garden-path sentences. The results revealed a striking alignment between ChatGPT's assessments and human performance. Sentences deemed more related and assessed as being more memorable by ChatGPT were indeed better remembered by humans, even though ChatGPT's internal mechanisms likely differ significantly from human cognition. This finding, which was confirmed with a robustness check employing synonyms, underscores the potential of generative AI models to predict human performance accurately. We discuss the broader implications of these findings for leveraging LLMs in the development of psychological theories and for gaining a deeper understanding of human cognition.
National Security Letters (NSLs) are similar to administrative subpoenas and can be issued directly by elements of the executive branch without requiring prior approval from a court or grand jury. Importantly, NSLs authorize the imposition of nondisclosure orders (aka "gag orders") on the receiving party. Controversy about potential abuses of this authority has driven a range of legal and policy discussions. To address these concerns, both the public sector and the private sector have sought to document the usage of NSLs in aggregated form. However, each data source is limited in scope, time, and kind. In this paper, we consolidate the available data around NSLs and answer two questions: (1) what can the public effectively learn from the reported data and does this information suffice to assess the NSL usage? (2) how accessible is this data collection? We show that longitudinal trends in the usage of NSLs can be observed. For instance, we find a significant increase in NSL requests for non-US persons and that the policy reforms to decrease the mandated nondisclosure period appear to be effective. The observed trends suggest that the current transparency mechanisms are viable safeguards against the excessive use of NSLs. However, aggregating and normalizing the data requires manual reviewing, parsing, and validating. We even find inconsistencies within and across official data sources. Overall, the laborious data collection process hinders external and internal auditing efforts and demonstrates the need for a unified and more usable dataset for NSLs.
Large language models (LLMs) are notorious for hallucinating, i.e., producing erroneous claims in their output. Such hallucinations can be dangerous, as occasional factual inaccuracies in the generated text might be obscured by the rest of the output being generally factual, making it extremely hard for the users to spot them. Current services that leverage LLMs usually do not provide any means for detecting unreliable generations. Here, we aim to bridge this gap. In particular, we propose a novel fact-checking and hallucination detection pipeline based on token-level uncertainty quantification. Uncertainty scores leverage information encapsulated in the output of a neural network or its layers to detect unreliable predictions, and we show that they can be used to fact-check the atomic claims in the LLM output. Moreover, we present a novel token-level uncertainty quantification method that removes the impact of uncertainty about what claim to generate on the current step and what surface form to use. Our method Claim Conditioned Probability (CCP) measures only the uncertainty of particular claim value expressed by the model. Experiments on the task of biography generation demonstrate strong improvements for CCP compared to the baselines for six different LLMs and three languages. Human evaluation reveals that the fact-checking pipeline based on uncertainty quantification is competitive with a fact-checking tool that leverages external knowledge.
A sememe is defined as the minimum semantic unit of human languages. Sememe knowledge bases (KBs), which contain words annotated with sememes, have been successfully applied to many NLP tasks. However, existing sememe KBs are built on only a few languages, which hinders their widespread utilization. To address the issue, we propose to build a unified sememe KB for multiple languages based on BabelNet, a multilingual encyclopedic dictionary. We first build a dataset serving as the seed of the multilingual sememe KB. It manually annotates sememes for over $15$ thousand synsets (the entries of BabelNet). Then, we present a novel task of automatic sememe prediction for synsets, aiming to expand the seed dataset into a usable KB. We also propose two simple and effective models, which exploit different information of synsets. Finally, we conduct quantitative and qualitative analyses to explore important factors and difficulties in the task. All the source code and data of this work can be obtained on //github.com/thunlp/BabelNet-Sememe-Prediction.
Named entity recognition (NER) is the task to identify text spans that mention named entities, and to classify them into predefined categories such as person, location, organization etc. NER serves as the basis for a variety of natural language applications such as question answering, text summarization, and machine translation. Although early NER systems are successful in producing decent recognition accuracy, they often require much human effort in carefully designing rules or features. In recent years, deep learning, empowered by continuous real-valued vector representations and semantic composition through nonlinear processing, has been employed in NER systems, yielding stat-of-the-art performance. In this paper, we provide a comprehensive review on existing deep learning techniques for NER. We first introduce NER resources, including tagged NER corpora and off-the-shelf NER tools. Then, we systematically categorize existing works based on a taxonomy along three axes: distributed representations for input, context encoder, and tag decoder. Next, we survey the most representative methods for recent applied techniques of deep learning in new NER problem settings and applications. Finally, we present readers with the challenges faced by NER systems and outline future directions in this area.