Generative diffusion models, including Stable Diffusion and Midjourney, can generate visually appealing, diverse, and high-resolution images for various applications. These models are trained on billions of internet-sourced images, raising significant concerns about the potential unauthorized use of copyright-protected images. In this paper, we examine whether it is possible to determine if a specific image was used in the training set, a problem known in the cybersecurity community and referred to as a membership inference attack. Our focus is on Stable Diffusion, and we address the challenge of designing a fair evaluation framework to answer this membership question. We propose a methodology to establish a fair evaluation setup and apply it to Stable Diffusion, enabling potential extensions to other generative models. Utilizing this evaluation setup, we execute membership attacks (both known and newly introduced). Our research reveals that previously proposed evaluation setups do not provide a full understanding of the effectiveness of membership inference attacks. We conclude that the membership inference attack remains a significant challenge for large diffusion models (often deployed as black-box systems), indicating that related privacy and copyright issues will persist in the foreseeable future.
Ever-larger language models with ever-increasing capabilities are by now well-established text processing tools. Alas, information extraction tasks such as named entity recognition are still largely unaffected by this progress as they are primarily based on the previous generation of encoder-only transformer models. Here, we propose a simple yet effective approach, Informed Named Entity Recognition Decoding (iNERD), which treats named entity recognition as a generative process. It leverages the language understanding capabilities of recent generative models in a future-proof manner and employs an informed decoding scheme incorporating the restricted nature of information extraction into open-ended text generation, improving performance and eliminating any risk of hallucinations. We coarse-tune our model on a merged named entity corpus to strengthen its performance, evaluate five generative language models on eight named entity recognition datasets, and achieve remarkable results, especially in an environment with an unknown entity class set, demonstrating the adaptability of the approach.
By incorporating additional contextual information, deep biasing methods have emerged as a promising solution for speech recognition of personalized words. However, for real-world voice assistants, always biasing on such personalized words with high prediction scores can significantly degrade the performance of recognizing common words. To address this issue, we propose an adaptive contextual biasing method based on Context-Aware Transformer Transducer (CATT) that utilizes the biased encoder and predictor embeddings to perform streaming prediction of contextual phrase occurrences. Such prediction is then used to dynamically switch the bias list on and off, enabling the model to adapt to both personalized and common scenarios. Experiments on Librispeech and internal voice assistant datasets show that our approach can achieve up to 6.7% and 20.7% relative reduction in WER and CER compared to the baseline respectively, mitigating up to 96.7% and 84.9% of the relative WER and CER increase for common cases. Furthermore, our approach has a minimal performance impact in personalized scenarios while maintaining a streaming inference pipeline with negligible RTF increase.
As surgical interventions trend towards minimally invasive approaches, Concentric Tube Robots (CTRs) have been explored for various interventions such as brain, eye, fetoscopic, lung, cardiac and prostate surgeries. Arranged concentrically, each tube is rotated and translated independently to move the robot end-effector position, making kinematics and control challenging. Classical model-based approaches have been previously investigated with developments in deep learning based approaches outperforming more classical approaches in both forward kinematics and shape estimation. We propose a deep reinforcement learning approach to control where we generalise across two to four systems, an element not yet achieved in any other deep learning approach for CTRs. In this way we explore the likely robustness of the control approach. Also investigated is the impact of rotational constraints applied on tube actuation and the effects on error metrics. We evaluate inverse kinematics errors and tracking error for path following tasks and compare the results to those achieved using state of the art methods. Additionally, as current results are performed in simulation, we also investigate a domain transfer approach known as domain randomization and evaluate error metrics as an initial step towards hardware implementation. Finally, we compare our method to a Jacobian approach found in literature.
Language models that are sensitive to external context can more effectively capture the speaking patterns of individuals with specific characteristics or in particular environments. However, obtaining and leveraging such annotations can be challenging. In this work, we show how to leverage rich character and film annotations to personalise language models in a scalable manner. Our best model can reduce perplexity by up to 6.5% compared to a parameter-matched language model. Our approach performs on par with speaker-specific fine-tuning when the fine-tuning data (i.e. past dialogue) for individual speakers is available. On top of that, it also generalises well to a scenario with no such data, relying on combinations of demographic characteristics expressed via metadata. Our findings are consistent across two corpora, one of which is also a contribution of this paper: Cornell-rich contains rich manual annotations for 863 speaking characters from the Cornell Movie Dialog Corpus, including features such as characteristic quotes and character descriptions, along with six automatically extracted metadata features for over 95% of the featured films. Finally, we also present a cost-benefit analysis highlighting which annotations are most cost-effective in reducing perplexity.
Sentence embeddings enable us to capture the semantic similarity of short texts. Most sentence embedding models are trained for general semantic textual similarity (STS) tasks. Therefore, to use sentence embeddings in a particular domain, the model must be adapted to it in order to achieve good results. Usually, this is done by fine-tuning the entire sentence embedding model for the domain of interest. While this approach yields state-of-the-art results, all of the model's weights are updated during fine-tuning, making this method resource-intensive. Therefore, instead of fine-tuning entire sentence embedding models for each target domain individually, we propose to train lightweight adapters. These domain-specific adapters do not require fine-tuning all underlying sentence embedding model parameters. Instead, we only train a small number of additional parameters while keeping the weights of the underlying sentence embedding model fixed. Training domain-specific adapters allows always using the same base model and only exchanging the domain-specific adapters to adapt sentence embeddings to a specific domain. We show that using adapters for parameter-efficient domain adaptation of sentence embeddings yields competitive performance within 1% of a domain-adapted, entirely fine-tuned sentence embedding model while only training approximately 3.6% of the parameters.
Interaction with virtual objects displayed in Optical See-through Augmented Reality is still mostly done with controllers or hand gestures. A much more intuitive way of interacting with virtual content is to use physical proxy objects to interact with the virtual objects. Here, the virtual model is superimposed on a physical object, which can then be touched and moved to interact with the virtual object. Since it is not possible to use an exact replica as a tangible proxy object for every use case, we conducted a study to determine the extent to which the shape of the physical object can deviate from the shape of the virtual object without massively impacting performance and usability, as well as the sense of presence. Our study, in which we investigated different levels of abstraction for a sofa model, shows that the physical proxy object can be abstracted to a certain degree. At the same time, our results indicate that the physical object must have at least a similar shape as the virtual object in order to serve as a suitable proxy.
Graph Neural Networks (GNNs) have recently been used for node and graph classification tasks with great success, but GNNs model dependencies among the attributes of nearby neighboring nodes rather than dependencies among observed node labels. In this work, we consider the task of inductive node classification using GNNs in supervised and semi-supervised settings, with the goal of incorporating label dependencies. Because current GNNs are not universal (i.e., most-expressive) graph representations, we propose a general collective learning approach to increase the representation power of any existing GNN. Our framework combines ideas from collective classification with self-supervised learning, and uses a Monte Carlo approach to sampling embeddings for inductive learning across graphs. We evaluate performance on five real-world network datasets and demonstrate consistent, significant improvement in node classification accuracy, for a variety of state-of-the-art GNNs.
Deep learning has revolutionized many machine learning tasks in recent years, ranging from image classification and video processing to speech recognition and natural language understanding. The data in these tasks are typically represented in the Euclidean space. However, there is an increasing number of applications where data are generated from non-Euclidean domains and are represented as graphs with complex relationships and interdependency between objects. The complexity of graph data has imposed significant challenges on existing machine learning algorithms. Recently, many studies on extending deep learning approaches for graph data have emerged. In this survey, we provide a comprehensive overview of graph neural networks (GNNs) in data mining and machine learning fields. We propose a new taxonomy to divide the state-of-the-art graph neural networks into different categories. With a focus on graph convolutional networks, we review alternative architectures that have recently been developed; these learning paradigms include graph attention networks, graph autoencoders, graph generative networks, and graph spatial-temporal networks. We further discuss the applications of graph neural networks across various domains and summarize the open source codes and benchmarks of the existing algorithms on different learning tasks. Finally, we propose potential research directions in this fast-growing field.
We investigate a lattice-structured LSTM model for Chinese NER, which encodes a sequence of input characters as well as all potential words that match a lexicon. Compared with character-based methods, our model explicitly leverages word and word sequence information. Compared with word-based methods, lattice LSTM does not suffer from segmentation errors. Gated recurrent cells allow our model to choose the most relevant characters and words from a sentence for better NER results. Experiments on various datasets show that lattice LSTM outperforms both word-based and character-based LSTM baselines, achieving the best results.
Providing model-generated explanations in recommender systems is important to user experience. State-of-the-art recommendation algorithms -- especially the collaborative filtering (CF) based approaches with shallow or deep models -- usually work with various unstructured information sources for recommendation, such as textual reviews, visual images, and various implicit or explicit feedbacks. Though structured knowledge bases were considered in content-based approaches, they have been largely ignored recently due to the availability of vast amount of data and the learning power of many complex models. However, structured knowledge bases exhibit unique advantages in personalized recommendation systems. When the explicit knowledge about users and items is considered for recommendation, the system could provide highly customized recommendations based on users' historical behaviors and the knowledge is helpful for providing informed explanations regarding the recommended items. In this work, we propose to reason over knowledge base embeddings for explainable recommendation. Specifically, we propose a knowledge base representation learning framework to embed heterogeneous entities for recommendation, and based on the embedded knowledge base, a soft matching algorithm is proposed to generate personalized explanations for the recommended items. Experimental results on real-world e-commerce datasets verified the superior recommendation performance and the explainability power of our approach compared with state-of-the-art baselines.