In his seminal paper "Computing Machinery and Intelligence", Alan Turing introduced the "imitation game" as part of exploring the concept of machine intelligence. The Turing Test has since been the subject of much analysis, debate, refinement and extension. Here we sidestep the question of whether a particular machine can be labeled intelligent, or can be said to match human capabilities in a given context. Instead, but inspired by Turing, we draw attention to the seemingly simpler challenge of determining whether one is interacting with a human or with a machine, in the context of everyday life. We are interested in reflecting upon the importance of this Human-or-Machine question and the use one may make of a reliable answer thereto. Whereas Turing's original test is widely considered to be more of a thought experiment, the Human-or-Machine question as discussed here has obvious practical significance. And while the jury is still not in regarding the possibility of machines that can mimic human behavior with high fidelity in everyday contexts, we argue that near-term exploration of the issues raised here can contribute to development methods for computerized systems, and may also improve our understanding of human behavior in general.
Vision Transformers (ViTs) have achieved state-of-the-art performance on various computer vision applications. However, these models have considerable storage and computational overheads, making their deployment and efficient inference on edge devices challenging. Quantization is a promising approach to reducing model complexity, and the dyadic arithmetic pipeline can allow the quantized models to perform efficient integer-only inference. Unfortunately, dyadic arithmetic is based on the homogeneity condition in convolutional neural networks, which is not applicable to the non-linear components in ViTs, making integer-only inference of ViTs an open issue. In this paper, we propose I-ViT, an integer-only quantization scheme for ViTs, to enable ViTs to perform the entire computational graph of inference with integer arithmetic and bit-shifting, and without any floating-point arithmetic. In I-ViT, linear operations (e.g., MatMul and Dense) follow the integer-only pipeline with dyadic arithmetic, and non-linear operations (e.g., Softmax, GELU, and LayerNorm) are approximated by the proposed light-weight integer-only arithmetic methods. More specifically, I-ViT applies the proposed Shiftmax and ShiftGELU, which are designed to use integer bit-shifting to approximate the corresponding floating-point operations. We evaluate I-ViT on various benchmark models and the results show that integer-only INT8 quantization achieves comparable (or even slightly higher) accuracy to the full-precision (FP) baseline. Furthermore, we utilize TVM for practical hardware deployment on the GPU's integer arithmetic units, achieving 3.72$\sim$4.11$\times$ inference speedup compared to the FP model. Code of both Pytorch and TVM is released at //github.com/zkkli/I-ViT.
This paper explores the Achievable Information Rate (AIR) of a diffusive Molecular Communication (MC) channel featuring a fully absorbing receiver that counts the absorbed particles during symbol time intervals (STIs) and resets the counter at the start of each interval. The MC channel, influenced by memory effect, experiences inter-symbol interference (ISI) arising from the molecules' delayed arrival. The channel's memory is quantified as an integer multiple of the STI and a single-sample memoryless detector is employed to mitigate complexity in computing the mutual information (MI). To maximize MI, the detector threshold is optimized under Gaussian approximation of its input. The channel's MI is calculated, considering the influence of ISI, in the context of binary concentration shift keying modulation. Two distinct scenarios were considered; independent and correlated source-generated symbols, the latter modeled as a first-order Markov process. For each communication scenario, two degrees of knowledge: ISI-Aware and ISI-Unaware were considered. Remarkably, it is demonstrated that employing a correlated source enables the attainment of higher capacity. The results indicate that the capacity-achieving input distribution is not necessarily uniform. Notably, when the STI is small, corresponding to the case of strong ISI, the maximum AIR is not achieved through equiprobable symbol transmission.
Achieving precise, highly-dynamic maneuvers with Unmanned Aerial Vehicles (UAVs) is a major challenge due to the complexity of the associated aerodynamics. In particular, unsteady effects -- as might be experienced in post-stall regimes or during sudden vehicle morphing -- can have an adverse impact on the performance of modern flight control systems. In this paper, we present a vortex particle model and associated model-based controller capable of reasoning about the unsteady aerodynamics during aggressive maneuvers. We evaluate our approach in hardware on a morphing-wing UAV executing post-stall perching maneuvers. Our results show that the use of the unsteady aerodynamics model improves performance during both fixed-wing and dynamic-wing perching, while the use of wing-morphing planned with quasi-steady aerodynamics results in reduced performance. While the focus of this paper is a pre-computed control policy, we believe that, with sufficient computational resources, our approach could enable online planning in the future.
Motivated by recent applications of sequential decision making in matching markets, in this paper we attempt at formulating and abstracting market designs for P2P lending. We describe a paradigm to set the stage for how peer to peer investments can be conceived from a matching market perspective, especially when both borrower and lender preferences are respected. We model these specialized markets as an optimization problem and consider different utilities for agents on both sides of the market while also understanding the impact of equitable allocations to borrowers. We devise a technique based on sequential decision making that allow the lenders to adjust their choices based on the dynamics of uncertainty from competition over time and that also impacts the rewards in return for their investments. Using simulated experiments we show the dynamics of the regret based on the optimal borrower-lender matching and find that the lender regret depends on the initial preferences set by the lenders which could affect their learning over decision making steps.
Knowledge plays a critical role in artificial intelligence. Recently, the extensive success of pre-trained language models (PLMs) has raised significant attention about how knowledge can be acquired, maintained, updated and used by language models. Despite the enormous amount of related studies, there still lacks a unified view of how knowledge circulates within language models throughout the learning, tuning, and application processes, which may prevent us from further understanding the connections between current progress or realizing existing limitations. In this survey, we revisit PLMs as knowledge-based systems by dividing the life circle of knowledge in PLMs into five critical periods, and investigating how knowledge circulates when it is built, maintained and used. To this end, we systematically review existing studies of each period of the knowledge life cycle, summarize the main challenges and current limitations, and discuss future directions.
We introduce DeepNash, an autonomous agent capable of learning to play the imperfect information game Stratego from scratch, up to a human expert level. Stratego is one of the few iconic board games that Artificial Intelligence (AI) has not yet mastered. This popular game has an enormous game tree on the order of $10^{535}$ nodes, i.e., $10^{175}$ times larger than that of Go. It has the additional complexity of requiring decision-making under imperfect information, similar to Texas hold'em poker, which has a significantly smaller game tree (on the order of $10^{164}$ nodes). Decisions in Stratego are made over a large number of discrete actions with no obvious link between action and outcome. Episodes are long, with often hundreds of moves before a player wins, and situations in Stratego can not easily be broken down into manageably-sized sub-problems as in poker. For these reasons, Stratego has been a grand challenge for the field of AI for decades, and existing AI methods barely reach an amateur level of play. DeepNash uses a game-theoretic, model-free deep reinforcement learning method, without search, that learns to master Stratego via self-play. The Regularised Nash Dynamics (R-NaD) algorithm, a key component of DeepNash, converges to an approximate Nash equilibrium, instead of 'cycling' around it, by directly modifying the underlying multi-agent learning dynamics. DeepNash beats existing state-of-the-art AI methods in Stratego and achieved a yearly (2022) and all-time top-3 rank on the Gravon games platform, competing with human expert players.
Pre-trained Language Models (PLMs) have achieved great success in various Natural Language Processing (NLP) tasks under the pre-training and fine-tuning paradigm. With large quantities of parameters, PLMs are computation-intensive and resource-hungry. Hence, model pruning has been introduced to compress large-scale PLMs. However, most prior approaches only consider task-specific knowledge towards downstream tasks, but ignore the essential task-agnostic knowledge during pruning, which may cause catastrophic forgetting problem and lead to poor generalization ability. To maintain both task-agnostic and task-specific knowledge in our pruned model, we propose ContrAstive Pruning (CAP) under the paradigm of pre-training and fine-tuning. It is designed as a general framework, compatible with both structured and unstructured pruning. Unified in contrastive learning, CAP enables the pruned model to learn from the pre-trained model for task-agnostic knowledge, and fine-tuned model for task-specific knowledge. Besides, to better retain the performance of the pruned model, the snapshots (i.e., the intermediate models at each pruning iteration) also serve as effective supervisions for pruning. Our extensive experiments show that adopting CAP consistently yields significant improvements, especially in extremely high sparsity scenarios. With only 3% model parameters reserved (i.e., 97% sparsity), CAP successfully achieves 99.2% and 96.3% of the original BERT performance in QQP and MNLI tasks. In addition, our probing experiments demonstrate that the model pruned by CAP tends to achieve better generalization ability.
Connecting Vision and Language plays an essential role in Generative Intelligence. For this reason, in the last few years, a large research effort has been devoted to image captioning, i.e. the task of describing images with syntactically and semantically meaningful sentences. Starting from 2015 the task has generally been addressed with pipelines composed of a visual encoding step and a language model for text generation. During these years, both components have evolved considerably through the exploitation of object regions, attributes, and relationships and the introduction of multi-modal connections, fully-attentive approaches, and BERT-like early-fusion strategies. However, regardless of the impressive results obtained, research in image captioning has not reached a conclusive answer yet. This work aims at providing a comprehensive overview and categorization of image captioning approaches, from visual encoding and text generation to training strategies, used datasets, and evaluation metrics. In this respect, we quantitatively compare many relevant state-of-the-art approaches to identify the most impactful technical innovations in image captioning architectures and training strategies. Moreover, many variants of the problem and its open challenges are analyzed and discussed. The final goal of this work is to serve as a tool for understanding the existing state-of-the-art and highlighting the future directions for an area of research where Computer Vision and Natural Language Processing can find an optimal synergy.
The goal of text ranking is to generate an ordered list of texts retrieved from a corpus in response to a query. Although the most common formulation of text ranking is search, instances of the task can also be found in many natural language processing applications. This survey provides an overview of text ranking with neural network architectures known as transformers, of which BERT is the best-known example. The combination of transformers and self-supervised pretraining has, without exaggeration, revolutionized the fields of natural language processing (NLP), information retrieval (IR), and beyond. In this survey, we provide a synthesis of existing work as a single point of entry for practitioners who wish to gain a better understanding of how to apply transformers to text ranking problems and researchers who wish to pursue work in this area. We cover a wide range of modern techniques, grouped into two high-level categories: transformer models that perform reranking in multi-stage ranking architectures and learned dense representations that attempt to perform ranking directly. There are two themes that pervade our survey: techniques for handling long documents, beyond the typical sentence-by-sentence processing approaches used in NLP, and techniques for addressing the tradeoff between effectiveness (result quality) and efficiency (query latency). Although transformer architectures and pretraining techniques are recent innovations, many aspects of how they are applied to text ranking are relatively well understood and represent mature techniques. However, there remain many open research questions, and thus in addition to laying out the foundations of pretrained transformers for text ranking, this survey also attempts to prognosticate where the field is heading.
We propose UniViLM: a Unified Video and Language pre-training Model for multimodal understanding and generation. Motivated by the recent success of BERT based pre-training technique for NLP and image-language tasks, VideoBERT and CBT are proposed to exploit BERT model for video and language pre-training using narrated instructional videos. Different from their works which only pre-train understanding task, we propose a unified video-language pre-training model for both understanding and generation tasks. Our model comprises of 4 components including two single-modal encoders, a cross encoder and a decoder with the Transformer backbone. We first pre-train our model to learn the universal representation for both video and language on a large instructional video dataset. Then we fine-tune the model on two multimodal tasks including understanding task (text-based video retrieval) and generation task (multimodal video captioning). Our extensive experiments show that our method can improve the performance of both understanding and generation tasks and achieves the state-of-the art results.