Given a binary word relation $\tau$ onto A * and a finite language X $\subseteq$ A * , a $\tau$-Gray cycle over X consists in a permutation w [i] 0$\le$i$\le$|X|--1 of X such that each word w [i] is an image under $\tau$ of the previous word w [i--1]. We define the complexity measure $\lambda$A,$\tau$ (n), equal to the largest cardinality of a language X having words of length at most n, and st. some $\tau$-Gray cycle over X exists. The present paper is concerned with $\tau$ = $\sigma$ k , the so-called k-character substitution, st. (u, v) $\in$ $\sigma$ k holds if, and only if, the Hamming distance of u and v is k. We present loopless (resp., constant amortized time) algorithms for computing specific maximum length $\sigma$ k-Gray cycles.
In-context learning with large language models (LLMs) has recently caught increasing attention due to its superior few-shot performance on various tasks. However, its performance on text-to-SQL parsing still has much room for improvement. In this paper, we hypothesize that a crucial aspect of LLMs to improve for text-to-SQL parsing is their multi-step reasoning ability. Thus, we systematically study how to enhance LLMs' reasoning ability through chain of thought (CoT) style prompting, including the original chain-of-thought prompting (Wei et al., 2022b) and least-to-most prompting (Zhou et al., 2023). Our experiments demonstrate that iterative prompting as in Zhou et al. (2023) may be unnecessary for text-to-SQL parsing, and using detailed reasoning steps tends to have more error propagation issues. Based on these findings, we propose a new CoT-style prompting method for text-to-SQL parsing. It brings 5.2 and 6.5 point absolute gains on the Spider development set and the Spider Realistic set, respectively, compared to the standard prompting method without reasoning steps; 2.4 and 1.5 point absolute gains, compared to the least-to-most prompting method.
Emotion recognition in text, the task of identifying emotions such as joy or anger, is a challenging problem in NLP with many applications. One of the challenges is the shortage of available datasets that have been annotated with emotions. Certain existing datasets are small, follow different emotion taxonomies and display imbalance in their emotion distribution. In this work, we studied the impact of data augmentation techniques precisely when applied to small imbalanced datasets, for which current state-of-the-art models (such as RoBERTa) under-perform. Specifically, we utilized four data augmentation methods (Easy Data Augmentation EDA, static and contextual Embedding-based, and ProtAugment) on three datasets that come from different sources and vary in size, emotion categories and distributions. Our experimental results show that using the augmented data when training the classifier model leads to significant improvements. Finally, we conducted two case studies: a) directly using the popular chat-GPT API to paraphrase text using different prompts, and b) using external data to augment the training set. Results show the promising potential of these methods.
In a recent breakthrough, Chen, Hirahara and Ren prove that $\mathsf{S_2E}/_1 \not\subset \mathsf{SIZE}[2^n/n]$ by giving a single-valued $\mathsf{FS_2P}$ algorithm for the Range Avoidance Problem ($\mathsf{Avoid}$) that works for infinitely many input size $n$. Building on their work, we present a simple single-valued $\mathsf{FS_2P}$ algorithm for $\mathsf{Avoid}$ that works for all input size $n$. As a result, we obtain the circuit lower bound $\mathsf{S_2E} \not\subset \mathsf{SIZE}[2^n/n]$ and many other corollaries: 1. Near-maximum circuit lower bound for $\mathsf{\Sigma_2E} \cap \mathsf{\Pi_2E}$ and $\mathsf{ZPE}^{\mathsf{NP}}$. 2. Pseudodeterministic $\mathsf{FZPP}^{\mathsf{NP}}$ constructions for: Ramsey graphs, rigid matrices, pseudorandom generators, two-source extractors, linear codes, hard truth tables, and $K^{poly}$-random strings.
The emergence of WebAssembly allows attackers to hide the malicious functionalities of JavaScript malware in cross-language interoperations, termed JavaScript-WebAssembly multilingual malware (JWMM). However, existing anti-virus solutions based on static program analysis are still limited to monolingual code. As a result, their detection effectiveness decreases significantly against JWMM. The detection of JWMM is challenging due to the complex interoperations and semantic diversity between JavaScript and WebAssembly. To bridge this gap, we present JWBinder, the first technique aimed at enhancing the static detection of JWMM. JWBinder performs a language-specific data-flow analysis to capture the cross-language interoperations and then characterizes the functionalities of JWMM through a unified high-level structure called Inter-language Program Dependency Graph. The extensive evaluation on one of the most representative real-world anti-virus platforms, VirusTotal, shows that \system effectively enhances anti-virus systems from various vendors and increases the overall successful detection rate against JWMM from 49.1\% to 86.2\%. Additionally, we assess the side effects and runtime overhead of JWBinder, corroborating its practical viability in real-world applications.
Adversarial contrastive learning (ACL) does not require expensive data annotations but outputs a robust representation that withstands adversarial attacks and also generalizes to a wide range of downstream tasks. However, ACL needs tremendous running time to generate the adversarial variants of all training data, which limits its scalability to large datasets. To speed up ACL, this paper proposes a robustness-aware coreset selection (RCS) method. RCS does not require label information and searches for an informative subset that minimizes a representational divergence, which is the distance of the representation between natural data and their virtual adversarial variants. The vanilla solution of RCS via traversing all possible subsets is computationally prohibitive. Therefore, we theoretically transform RCS into a surrogate problem of submodular maximization, of which the greedy search is an efficient solution with an optimality guarantee for the original problem. Empirically, our comprehensive results corroborate that RCS can speed up ACL by a large margin without significantly hurting the robustness transferability. Notably, to the best of our knowledge, we are the first to conduct ACL efficiently on the large-scale ImageNet-1K dataset to obtain an effective robust representation via RCS. Our source code is at //github.com/GodXuxilie/Efficient_ACL_via_RCS.
In this paper, we consider the counting function $E_P(y) = |P_{y} \cap Z^{n_x}|$ for a parametric polyhedron $P_{y} = \{x \in R^{n_x} \colon A x \leq b + B y\}$, where $y \in R^{n_y}$. We give a new representation of $E_P(y)$, called a \emph{piece-wise step-polynomial with periodic coefficients}, which is a generalization of piece-wise step-polynomials and integer/rational Ehrhart's quasi-polynomials. It gives the fastest way to calculate $E_P(y)$ in certain scenarios. The most important cases are the following: 1) We show that, for any $y \in Q^k$ and parametric polyhedron $P_y$, defined by a standard-form system $A x = y,\, x \geq 0$ with a fixed number of equalities, $E_P(y)$ can be computed by a $poly\bigl(n, \|A\|_{\infty}\bigr)$-time algorithm; 2) Assuming that the co-dimension is fixed, we show that integer/rational Ehrhart's quasi-polynomials of a polytope can be computed by FPT-algorithms, parameterized by sub-determinants of $A$ or its elements; 3) Our representation of $E_P$ is more efficient than other known approaches, if $A$ has bounded elements, especially if it is sparse in addition. Additionally, we provide a discussion about possible applications in the area of compiler optimization. In some "natural" assumptions on a program code, our approach has the fastest complexity bounds.
The emergence of large language models (LLMs) has substantially influenced natural language processing, demonstrating exceptional results across various tasks. In this study, we employ ``Introspective Tips" to facilitate LLMs in self-optimizing their decision-making. By introspectively examining trajectories, LLM refines its policy by generating succinct and valuable tips. Our method enhances the agent's performance in both few-shot and zero-shot learning situations by considering three essential scenarios: learning from the agent's past experiences, integrating expert demonstrations, and generalizing across diverse games. Importantly, we accomplish these improvements without fine-tuning the LLM parameters; rather, we adjust the prompt to generalize insights from the three aforementioned situations. Our framework not only supports but also emphasizes the advantage of employing LLM in in-contxt decision-making. Experiments involving over 100 games in TextWorld illustrate the superior performance of our approach.
Named entity recognition (NER) is the task to identify text spans that mention named entities, and to classify them into predefined categories such as person, location, organization etc. NER serves as the basis for a variety of natural language applications such as question answering, text summarization, and machine translation. Although early NER systems are successful in producing decent recognition accuracy, they often require much human effort in carefully designing rules or features. In recent years, deep learning, empowered by continuous real-valued vector representations and semantic composition through nonlinear processing, has been employed in NER systems, yielding stat-of-the-art performance. In this paper, we provide a comprehensive review on existing deep learning techniques for NER. We first introduce NER resources, including tagged NER corpora and off-the-shelf NER tools. Then, we systematically categorize existing works based on a taxonomy along three axes: distributed representations for input, context encoder, and tag decoder. Next, we survey the most representative methods for recent applied techniques of deep learning in new NER problem settings and applications. Finally, we present readers with the challenges faced by NER systems and outline future directions in this area.
A sememe is defined as the minimum semantic unit of human languages. Sememe knowledge bases (KBs), which contain words annotated with sememes, have been successfully applied to many NLP tasks. However, existing sememe KBs are built on only a few languages, which hinders their widespread utilization. To address the issue, we propose to build a unified sememe KB for multiple languages based on BabelNet, a multilingual encyclopedic dictionary. We first build a dataset serving as the seed of the multilingual sememe KB. It manually annotates sememes for over $15$ thousand synsets (the entries of BabelNet). Then, we present a novel task of automatic sememe prediction for synsets, aiming to expand the seed dataset into a usable KB. We also propose two simple and effective models, which exploit different information of synsets. Finally, we conduct quantitative and qualitative analyses to explore important factors and difficulties in the task. All the source code and data of this work can be obtained on //github.com/thunlp/BabelNet-Sememe-Prediction.
Manually labeling objects by tracing their boundaries is a laborious process. In Polygon-RNN++ the authors proposed Polygon-RNN that produces polygonal annotations in a recurrent manner using a CNN-RNN architecture, allowing interactive correction via humans-in-the-loop. We propose a new framework that alleviates the sequential nature of Polygon-RNN, by predicting all vertices simultaneously using a Graph Convolutional Network (GCN). Our model is trained end-to-end. It supports object annotation by either polygons or splines, facilitating labeling efficiency for both line-based and curved objects. We show that Curve-GCN outperforms all existing approaches in automatic mode, including the powerful PSP-DeepLab and is significantly more efficient in interactive mode than Polygon-RNN++. Our model runs at 29.3ms in automatic, and 2.6ms in interactive mode, making it 10x and 100x faster than Polygon-RNN++.