We develop a novel asymptotic theory for local polynomial (quasi-) maximum-likelihood estimators of time-varying parameters in a broad class of nonlinear time series models. Under weak regularity conditions, we show the proposed estimators are consistent and follow normal distributions in large samples. Our conditions impose weaker smoothness and moment conditions on the data-generating process and its likelihood compared to existing theories. Furthermore, the bias terms of the estimators take a simpler form. We demonstrate the usefulness of our general results by applying our theory to local (quasi-)maximum-likelihood estimators of a time-varying VAR's, ARCH and GARCH, and Poisson autogressions. For the first three models, we are able to substantially weaken the conditions found in the existing literature. For the Poisson autogression, existing theories cannot be be applied while our novel approach allows us to analyze it.
An executable binary typically contains a large number of machine instructions. Although the statistics of popular instructions is well known, the distribution of non-popular instructions has been relatively under explored. Our finding shows that an arbitrary group of binaries com es with both i) a similar distribution of common machine instructions, and ii) quite a few rarely appeared instructions (e.g., less than five occurrences) apart from the distribution. Their infrequency may represent the signature of a code chunk or the footprint of a binary. In this work, we investigate such rare instructions with an in-depth analysis at the source level, clas sifying them into four categories.
The study of persistence rests largely on the result that any finitely-indexed persistence module of finite-dimensional vector spaces admits an interval decomposition -- that is, a decomposition as a direct sum of interval modules. This result fails if we replace vector spaces with modules over more general coefficient rings. For example, not every persistence module of finitely-generated free abelian groups admits an interval decomposition. Nevertheless, many interesting examples of such persistence modules have been empirically observed to decompose into intervals. Due to the prevalence of these modules in applied and theoretical settings, it is important to understand the conditions under which interval decomposition is possible. We provide a necessary and sufficient condition, and a polynomial-time algorithm to either (a) compute an interval decomposition of a persistence module of free abelian groups, or (b) certify that no such decomposition exists. This complements earlier work, which characterizes filtered topological spaces whose persistence diagrams are independent of the choice of ground field.
Many evolutionary algorithms (EAs) take advantage of parallel evaluation of candidates. However, if evaluation times vary significantly, many worker nodes (i.e.,\ compute clients) are idle much of the time, waiting for the next generation to be created. Evolutionary neural architecture search (ENAS), a class of EAs that optimizes the architecture and hyperparameters of deep neural networks, is particularly vulnerable to this issue. This paper proposes a generic asynchronous evaluation strategy (AES) that is then adapted to work with ENAS. AES increases throughput by maintaining a queue of up to $K$ individuals ready to be sent to the workers for evaluation and proceeding to the next generation as soon as $M<<K$ individuals have been evaluated. A suitable value for $M$ is determined experimentally, balancing diversity and efficiency. To showcase the generality and power of AES, it was first evaluated in eight-line sorting network design (a single-population optimization task with limited evaluation-time variability), achieving an over two-fold speedup. Next, it was evaluated in 11-bit multiplexer design (a single-population discovery task with extended variability), where a 14-fold speedup was observed. It was then scaled up to ENAS for image captioning (a multi-population open-ended-optimization task), resulting in an over two-fold speedup. In all problems, a multifold performance improvement was observed, suggesting that AES is a promising method for parallelizing the evolution of complex systems with long and variable evaluation times, such as those in ENAS.
We develop a class of interacting particle systems for implementing a maximum marginal likelihood estimation (MMLE) procedure to estimate the parameters of a latent variable model. We achieve this by formulating a continuous-time interacting particle system which can be seen as a Langevin diffusion over an extended state space of parameters and latent variables. In particular, we prove that the parameter marginal of the stationary measure of this diffusion has the form of a Gibbs measure where number of particles acts as the inverse temperature parameter in classical settings for global optimisation. Using a particular rescaling, we then prove geometric ergodicity of this system and bound the discretisation error in a manner that is uniform in time and does not increase with the number of particles. The discretisation results in an algorithm, termed Interacting Particle Langevin Algorithm (IPLA) which can be used for MMLE. We further prove nonasymptotic bounds for the optimisation error of our estimator in terms of key parameters of the problem, and also extend this result to the case of stochastic gradients covering practical scenarios. We provide numerical experiments to illustrate the empirical behaviour of our algorithm in the context of logistic regression with verifiable assumptions. Our setting provides a straightforward way to implement a diffusion-based optimisation routine compared to more classical approaches such as the Expectation Maximisation (EM) algorithm, and allows for especially explicit nonasymptotic bounds.
An important aspect in developing language models that interact with humans is aligning their behavior to be useful and unharmful for their human users. This is usually achieved by tuning the model in a way that enhances desired behaviors and inhibits undesired ones, a process referred to as alignment. In this paper, we propose a theoretical approach called Behavior Expectation Bounds (BEB) which allows us to formally investigate several inherent characteristics and limitations of alignment in large language models. Importantly, we prove that within the limits of this framework, for any behavior that has a finite probability of being exhibited by the model, there exist prompts that can trigger the model into outputting this behavior, with probability that increases with the length of the prompt. This implies that any alignment process that attenuates an undesired behavior but does not remove it altogether, is not safe against adversarial prompting attacks. Furthermore, our framework hints at the mechanism by which leading alignment approaches such as reinforcement learning from human feedback make the LLM prone to being prompted into the undesired behaviors. This theoretical result is being experimentally demonstrated in large scale by the so called contemporary "chatGPT jailbreaks", where adversarial users trick the LLM into breaking its alignment guardrails by triggering it into acting as a malicious persona. Our results expose fundamental limitations in alignment of LLMs and bring to the forefront the need to devise reliable mechanisms for ensuring AI safety.
Deep Language Models (DLMs) provide a novel computational paradigm for understanding the mechanisms of natural language processing in the human brain. Unlike traditional psycholinguistic models, DLMs use layered sequences of continuous numerical vectors to represent words and context, allowing a plethora of emerging applications such as human-like text generation. In this paper we show evidence that the layered hierarchy of DLMs may be used to model the temporal dynamics of language comprehension in the brain by demonstrating a strong correlation between DLM layer depth and the time at which layers are most predictive of the human brain. Our ability to temporally resolve individual layers benefits from our use of electrocorticography (ECoG) data, which has a much higher temporal resolution than noninvasive methods like fMRI. Using ECoG, we record neural activity from participants listening to a 30-minute narrative while also feeding the same narrative to a high-performing DLM (GPT2-XL). We then extract contextual embeddings from the different layers of the DLM and use linear encoding models to predict neural activity. We first focus on the Inferior Frontal Gyrus (IFG, or Broca's area) and then extend our model to track the increasing temporal receptive window along the linguistic processing hierarchy from auditory to syntactic and semantic areas. Our results reveal a connection between human language processing and DLMs, with the DLM's layer-by-layer accumulation of contextual information mirroring the timing of neural activity in high-order language areas.
As soon as abstract mathematical computations were adapted to computation on digital computers, the problem of efficient representation, manipulation, and communication of the numerical values in those computations arose. Strongly related to the problem of numerical representation is the problem of quantization: in what manner should a set of continuous real-valued numbers be distributed over a fixed discrete set of numbers to minimize the number of bits required and also to maximize the accuracy of the attendant computations? This perennial problem of quantization is particularly relevant whenever memory and/or computational resources are severely restricted, and it has come to the forefront in recent years due to the remarkable performance of Neural Network models in computer vision, natural language processing, and related areas. Moving from floating-point representations to low-precision fixed integer values represented in four bits or less holds the potential to reduce the memory footprint and latency by a factor of 16x; and, in fact, reductions of 4x to 8x are often realized in practice in these applications. Thus, it is not surprising that quantization has emerged recently as an important and very active sub-area of research in the efficient implementation of computations associated with Neural Networks. In this article, we survey approaches to the problem of quantizing the numerical values in deep Neural Network computations, covering the advantages/disadvantages of current methods. With this survey and its organization, we hope to have presented a useful snapshot of the current research in quantization for Neural Networks and to have given an intelligent organization to ease the evaluation of future research in this area.
Named entity recognition (NER) is the task to identify text spans that mention named entities, and to classify them into predefined categories such as person, location, organization etc. NER serves as the basis for a variety of natural language applications such as question answering, text summarization, and machine translation. Although early NER systems are successful in producing decent recognition accuracy, they often require much human effort in carefully designing rules or features. In recent years, deep learning, empowered by continuous real-valued vector representations and semantic composition through nonlinear processing, has been employed in NER systems, yielding stat-of-the-art performance. In this paper, we provide a comprehensive review on existing deep learning techniques for NER. We first introduce NER resources, including tagged NER corpora and off-the-shelf NER tools. Then, we systematically categorize existing works based on a taxonomy along three axes: distributed representations for input, context encoder, and tag decoder. Next, we survey the most representative methods for recent applied techniques of deep learning in new NER problem settings and applications. Finally, we present readers with the challenges faced by NER systems and outline future directions in this area.
With the rapid increase of large-scale, real-world datasets, it becomes critical to address the problem of long-tailed data distribution (i.e., a few classes account for most of the data, while most classes are under-represented). Existing solutions typically adopt class re-balancing strategies such as re-sampling and re-weighting based on the number of observations for each class. In this work, we argue that as the number of samples increases, the additional benefit of a newly added data point will diminish. We introduce a novel theoretical framework to measure data overlap by associating with each sample a small neighboring region rather than a single point. The effective number of samples is defined as the volume of samples and can be calculated by a simple formula $(1-\beta^{n})/(1-\beta)$, where $n$ is the number of samples and $\beta \in [0,1)$ is a hyperparameter. We design a re-weighting scheme that uses the effective number of samples for each class to re-balance the loss, thereby yielding a class-balanced loss. Comprehensive experiments are conducted on artificially induced long-tailed CIFAR datasets and large-scale datasets including ImageNet and iNaturalist. Our results show that when trained with the proposed class-balanced loss, the network is able to achieve significant performance gains on long-tailed datasets.
We propose a novel approach to multimodal sentiment analysis using deep neural networks combining visual analysis and natural language processing. Our goal is different than the standard sentiment analysis goal of predicting whether a sentence expresses positive or negative sentiment; instead, we aim to infer the latent emotional state of the user. Thus, we focus on predicting the emotion word tags attached by users to their Tumblr posts, treating these as "self-reported emotions." We demonstrate that our multimodal model combining both text and image features outperforms separate models based solely on either images or text. Our model's results are interpretable, automatically yielding sensible word lists associated with emotions. We explore the structure of emotions implied by our model and compare it to what has been posited in the psychology literature, and validate our model on a set of images that have been used in psychology studies. Finally, our work also provides a useful tool for the growing academic study of images - both photographs and memes - on social networks.