This paper proposes a deep-learning-based approach to writer retrieval and identification for papyri, with a focus on identifying fragments associated with a specific writer and those corresponding to the same image. We present a novel neural network architecture that combines a residual backbone with a feature mixing stage to improve retrieval performance, and the final descriptor is derived from a projection layer. The methodology is evaluated on two benchmarks: PapyRow, where we achieve a mAP of 26.6 % and 24.9 % on writer and page retrieval, and HisFragIR20, showing state-of-the-art performance (44.0 % and 29.3 % mAP). Furthermore, our network has an accuracy of 28.7 % for writer identification. Additionally, we conduct experiments on the influence of two binarization techniques on fragments and show that binarizing does not enhance performance. Our code and models are available to the community.
This paper is a call to action for research and discussion on data visualization education. As visualization evolves and spreads through our professional and personal lives, we need to understand how to support and empower a broad and diverse community of learners in visualization. Data Visualization is a diverse and dynamic discipline that combines knowledge from different fields, is tailored to suit diverse audiences and contexts, and frequently incorporates tacit knowledge. This complex nature leads to a series of interrelated challenges for data visualization education. Driven by a lack of consolidated knowledge, overview, and orientation for visualization education, the 21 authors of this paper-educators and researchers in data visualization-identify and describe 19 challenges informed by our collective practical experience. We organize these challenges around seven themes People, Goals & Assessment, Environment, Motivation, Methods, Materials, and Change. Across these themes, we formulate 43 research questions to address these challenges. As part of our call to action, we then conclude with 5 cross-cutting opportunities and respective action items: embrace DIVERSITY+INCLUSION, build COMMUNITIES, conduct RESEARCH, act AGILE, and relish RESPONSIBILITY. We aim to inspire researchers, educators and learners to drive visualization education forward and discuss why, how, who and where we educate, as we learn to use visualization to address challenges across many scales and many domains in a rapidly changing world: viseducationchallenges.github.io.
Text injection for automatic speech recognition (ASR), wherein unpaired text-only data is used to supplement paired audio-text data, has shown promising improvements for word error rate. This study examines the use of text injection for auxiliary tasks, which are the non-ASR tasks often performed by an E2E model. In this work, we use joint end-to-end and internal language model training (JEIT) as our text injection algorithm to train an ASR model which performs two auxiliary tasks. The first is capitalization, which is a de-normalization task. The second is turn-taking prediction, which attempts to identify whether a user has completed their conversation turn in a digital assistant interaction. We show results demonstrating that our text injection method boosts capitalization performance for long-tail data, and improves turn-taking detection recall.
This paper proposes a novel unsupervised domain adaption (UDA) method based on contrastive bi-projector (CBP), which can improve the existing UDA methods. It is called CBPUDA here, which effectively promotes the feature extractors (FEs) to reduce the generation of ambiguous features for classification and domain adaption. The CBP differs from traditional bi-classifier-based methods at that these two classifiers are replaced with two projectors of performing a mapping from the input feature to two distinct features. These two projectors and the FEs in the CBPUDA can be trained adversarially to obtain more refined decision boundaries so that it can possess powerful classification performance. Two properties of the proposed loss function are analyzed here. The first property is to derive an upper bound of joint prediction entropy, which is used to form the proposed loss function, contrastive discrepancy (CD) loss. The CD loss takes the advantages of the contrastive learning and the bi-classifier. The second property is to analyze the gradient of the CD loss and then overcome the drawback of the CD loss. The result of the second property is utilized in the development of the gradient scaling (GS) scheme in this paper. The GS scheme can be exploited to tackle the unstable problem of the CD loss because training the CBPUDA requires using contrastive learning and adversarial learning at the same time. Therefore, using the CD loss with the GS scheme overcomes the problem mentioned above to make features more compact for intra-class and distinguishable for inter-class. Experimental results express that the CBPUDA is superior to conventional UDA methods under consideration in this paper for UDA and fine-grained UDA tasks.
This paper presents a modular approach to motion planning with provable stability guarantees for robots that move through changing environments via periodic locomotion behaviors. We focus on dynamic walkers as a paradigm for such systems, although the tools developed in this paper can be used to support general compositional approaches to robot motion planning with Dynamic Movement Primitives (DMPs). Our approach ensures a priori that the suggested plan can be stably executed. This is achieved by formulating the planning process as a Switching System with Multiple Equilibria (SSME) and proving that the system's evolution remains within explicitly characterized trapping regions in the state space under suitable constraints on the frequency of switching among the DMPs. These conditions effectively encapsulate the low-level stability limitations in a form that can be easily communicated to the planner to guarantee that the suggested plan is compatible with the robot's dynamics. Furthermore, we show how the available primitives can be safely composed online in a receding horizon manner to enable the robot to react to moving obstacles. The proposed framework is applied on 3D bipedal walking models under common modeling assumptions, and offers a modular approach towards stably integrating readily available low-level locomotion control and high-level planning methods.
This paper studies a novel movable antenna (MA)-enhanced multiple-input multiple-output (MIMO) system to leverage the corresponding spatial degrees of freedom (DoFs) for improving the performance of wireless communications. We aim to maximize the achievable rate by jointly optimizing the MA positions and the transmit covariance matrix based on statistical channel state information (CSI). To solve the resulting design problem, we develop a constrained stochastic successive convex approximation (CSSCA) algorithm applicable for the general movement mode. Furthermore, we propose two simplified antenna movement modes, namely the linear movement mode and the planar movement mode, to facilitate efficient antenna movement and reduce the computational complexity of the CSSCA algorithm. Numerical results show that the considered MA-enhanced system can significantly improve the achievable rate compared to conventional MIMO systems employing uniform planar arrays (UPAs) and that the proposed planar movement mode performs closely to the performance upper bound achieved by the general movement mode.
Aiming at expanding few-shot relations' coverage in knowledge graphs (KGs), few-shot knowledge graph completion (FKGC) has recently gained more research interests. Some existing models employ a few-shot relation's multi-hop neighbor information to enhance its semantic representation. However, noise neighbor information might be amplified when the neighborhood is excessively sparse and no neighbor is available to represent the few-shot relation. Moreover, modeling and inferring complex relations of one-to-many (1-N), many-to-one (N-1), and many-to-many (N-N) by previous knowledge graph completion approaches requires high model complexity and a large amount of training instances. Thus, inferring complex relations in the few-shot scenario is difficult for FKGC models due to limited training instances. In this paper, we propose a few-shot relational learning with global-local framework to address the above issues. At the global stage, a novel gated and attentive neighbor aggregator is built for accurately integrating the semantics of a few-shot relation's neighborhood, which helps filtering the noise neighbors even if a KG contains extremely sparse neighborhoods. For the local stage, a meta-learning based TransH (MTransH) method is designed to model complex relations and train our model in a few-shot learning fashion. Extensive experiments show that our model outperforms the state-of-the-art FKGC approaches on the frequently-used benchmark datasets NELL-One and Wiki-One. Compared with the strong baseline model MetaR, our model achieves 5-shot FKGC performance improvements of 8.0% on NELL-One and 2.8% on Wiki-One by the metric Hits@10.
In this paper, we propose a novel Feature Decomposition and Reconstruction Learning (FDRL) method for effective facial expression recognition. We view the expression information as the combination of the shared information (expression similarities) across different expressions and the unique information (expression-specific variations) for each expression. More specifically, FDRL mainly consists of two crucial networks: a Feature Decomposition Network (FDN) and a Feature Reconstruction Network (FRN). In particular, FDN first decomposes the basic features extracted from a backbone network into a set of facial action-aware latent features to model expression similarities. Then, FRN captures the intra-feature and inter-feature relationships for latent features to characterize expression-specific variations, and reconstructs the expression feature. To this end, two modules including an intra-feature relation modeling module and an inter-feature relation modeling module are developed in FRN. Experimental results on both the in-the-lab databases (including CK+, MMI, and Oulu-CASIA) and the in-the-wild databases (including RAF-DB and SFEW) show that the proposed FDRL method consistently achieves higher recognition accuracy than several state-of-the-art methods. This clearly highlights the benefit of feature decomposition and reconstruction for classifying expressions.
Named entity recognition (NER) is the task to identify text spans that mention named entities, and to classify them into predefined categories such as person, location, organization etc. NER serves as the basis for a variety of natural language applications such as question answering, text summarization, and machine translation. Although early NER systems are successful in producing decent recognition accuracy, they often require much human effort in carefully designing rules or features. In recent years, deep learning, empowered by continuous real-valued vector representations and semantic composition through nonlinear processing, has been employed in NER systems, yielding stat-of-the-art performance. In this paper, we provide a comprehensive review on existing deep learning techniques for NER. We first introduce NER resources, including tagged NER corpora and off-the-shelf NER tools. Then, we systematically categorize existing works based on a taxonomy along three axes: distributed representations for input, context encoder, and tag decoder. Next, we survey the most representative methods for recent applied techniques of deep learning in new NER problem settings and applications. Finally, we present readers with the challenges faced by NER systems and outline future directions in this area.
We introduce a multi-task setup of identifying and classifying entities, relations, and coreference clusters in scientific articles. We create SciERC, a dataset that includes annotations for all three tasks and develop a unified framework called Scientific Information Extractor (SciIE) for with shared span representations. The multi-task setup reduces cascading errors between tasks and leverages cross-sentence relations through coreference links. Experiments show that our multi-task model outperforms previous models in scientific information extraction without using any domain-specific features. We further show that the framework supports construction of a scientific knowledge graph, which we use to analyze information in scientific literature.
In this paper, we propose the joint learning attention and recurrent neural network (RNN) models for multi-label classification. While approaches based on the use of either model exist (e.g., for the task of image captioning), training such existing network architectures typically require pre-defined label sequences. For multi-label classification, it would be desirable to have a robust inference process, so that the prediction error would not propagate and thus affect the performance. Our proposed model uniquely integrates attention and Long Short Term Memory (LSTM) models, which not only addresses the above problem but also allows one to identify visual objects of interests with varying sizes without the prior knowledge of particular label ordering. More importantly, label co-occurrence information can be jointly exploited by our LSTM model. Finally, by advancing the technique of beam search, prediction of multiple labels can be efficiently achieved by our proposed network model.