RNA design aims to find a sequence that folds with highest probability into a designated target structure. However, certain structures are undesignable, meaning no sequence can fold into the target structure under the default (Turner) RNA folding model. Understanding the specific local structures (i.e., "motifs") that contribute to undesignability is crucial for refining RNA folding models and determining the limits of RNA designability. Despite its importance, this problem has received very little attention, and previous efforts are neither scalable nor interpretable. We develop a new theoretical framework for motif (un-)designability, and design scalable and interpretable algorithms to identify minimal undesignable motifs within a given RNA secondary structure. Our approach establishes motif undesignability by searching for rival motifs, rather than exhaustively enumerating all (partial) sequences that could potentially fold into the motif. Furthermore, we exploit rotational invariance in RNA structures to detect, group, and reuse equivalent motifs and to construct a database of unique minimal undesignable motifs. To achieve that, we propose a loop-pair graph representation for motifs and a recursive graph isomorphism algorithm for motif equivalence. Our algorithms successfully identify 24 unique minimal undesignable motifs among 18 undesignable puzzles from the Eterna100 benchmark. Surprisingly, we also find over 350 unique minimal undesignable motifs and 663 undesignable native structures in the ArchiveII dataset, drawn from a diverse set of RNA families. Our source code is available at //github.com/shanry/RNA-Undesign and our web server is available at //linearfold.org/motifs.
We prove that the celebrated Planar Product Structure Theorem by Dujmovic et al, and also related graph product structure results, can be formulated with the induced subgraph containment relation. Precisely, we prove that if a graph G is a subgraph of the strong product of a graph Q of bounded maximum degree (such as a path) and a graph M of bounded tree-width, then G is an induced subgraph of the strong product of Q and a graph M' of bounded tree-width being at most exponential in the maximum degree of Q and the tree-width of M. In particular, if G is planar, we show that G is an induced subgraph of the strong product of a path and a graph of tree-width 39. In the course of proving this result, we introduce and study H-clique-width, a new single structural measure that captures a hereditary analogue of the traditional product structure (where, informally, the strong product has one factor from the graph class H and one factor of bounded clique-width).
Language models (LMs) are capable of acquiring elements of human-like syntactic knowledge. Targeted syntactic evaluation tests have been employed to measure how well they form generalizations about syntactic phenomena in high-resource languages such as English. However, we still lack a thorough understanding of LMs' capacity for syntactic generalizations in low-resource languages, which are responsible for much of the diversity of syntactic patterns worldwide. In this study, we develop targeted syntactic evaluation tests for three low-resource languages (Basque, Hindi, and Swahili) and use them to evaluate five families of open-access multilingual Transformer LMs. We find that some syntactic tasks prove relatively easy for LMs while others (agreement in sentences containing indirect objects in Basque, agreement across a prepositional phrase in Swahili) are challenging. We additionally uncover issues with publicly available Transformers, including a bias toward the habitual aspect in Hindi in multilingual BERT and underperformance compared to similar-sized models in XGLM-4.5B.
Modern artificial intelligence (AI) applications require large quantities of training and test data. This need creates critical challenges not only concerning the availability of such data, but also regarding its quality. For example, incomplete, erroneous, or inappropriate training data can lead to unreliable models that ultimately produce poor decisions. Trustworthy AI applications require high-quality training and test data along many quality dimensions, such as accuracy, completeness, and consistency. We explore empirically the relationship between six data quality dimensions and the performance of 19 popular machine learning algorithms covering the tasks of classification, regression, and clustering, with the goal of explaining their performance in terms of data quality. Our experiments distinguish three scenarios based on the AI pipeline steps that were fed with polluted data: polluted training data, test data, or both. We conclude the paper with an extensive discussion of our observations.
We introduce a new erasure decoder that applies to arbitrary quantum LDPC codes. Dubbed the cluster decoder, it generalizes the decomposition idea of Vertical-Horizontal (VH) decoding introduced by Connelly et al. in 2022. Like the VH decoder, the idea is to first run the peeling decoder and then post-process the resulting stopping set. The cluster decoder breaks the stopping set into a tree of clusters which can be solved sequentially via Gaussian Elimination (GE). By allowing clusters of unconstrained size, this decoder achieves maximum-likelihood (ML) performance with reduced complexity compared with full GE. When GE is applied only to clusters whose sizes are less than a constant, the performance is degraded but the complexity becomes linear in the block length. Our simulation results show that, for hypergraph product codes, the cluster decoder with constant cluster size achieves near-ML performance similar to VH decoding in the low-erasure-rate regime. For the general quantum LDPC codes we studied, the cluster decoder can be used to estimate the ML performance curve with reduced complexity over a wide range of erasure rates.
The integration of DNA methylation data with a Whole Slide Image (WSI) offers significant potential for enhancing the diagnostic precision of central nervous system (CNS) tumor classification in neuropathology. While existing approaches typically integrate encoded omic data with histology at either an early or late fusion stage, the potential of reintroducing omic data through dual fusion remains unexplored. In this paper, we propose the use of omic embeddings during early and late fusion to capture complementary information from local (patch-level) to global (slide-level) interactions, boosting performance through multimodal integration. In the early fusion stage, omic embeddings are projected onto WSI patches in latent-space, which generates embeddings that encapsulate per-patch molecular and morphological insights. This effectively incorporates omic information into the spatial representation of the WSI. These embeddings are then refined with a Multiple Instance Learning gated attention mechanism which attends to diagnostic patches. In the late fusion stage, we reintroduce the omic data by fusing it with slide-level omic-WSI embeddings using a Multimodal Outer Arithmetic Block (MOAB), which richly intermingles features from both modalities, capturing their correlations and complementarity. We demonstrate accurate CNS tumor subtyping across 20 fine-grained subtypes and validate our approach on benchmark datasets, achieving improved survival prediction on TCGA-BLCA and competitive performance on TCGA-BRCA compared to state-of-the-art methods. This dual fusion strategy enhances interpretability and classification performance, highlighting its potential for clinical diagnostics.
Model checking is a fundamental technique for verifying finite state concurrent systems. Traditionally, model designs were initially created to facilitate the application of model checking. This process, representative of Model Driven Development (MDD), involves generating an equivalent code from a given model which is verified before implementation begins. However, this approach is considerably slower compared to agile development methods and lacks flexibility in terms of expandability and refactoring. We have proposed "CodoMo: Python Code to Model Generator for pyModelChecking." This tool automates the transformation of a Python code by an AST static analyzer and a concolic testing tool into intermediate models suitable for verification with pyModelChecking, bridging the gap between traditional model checking and agile methodologies. Additionally, we have implemented a multiprocess approach that integrates the execution of PyExZ3 with the generation of Kripke structures achieving greater work efficiency. By employing CodoMo, we successfully verified a Tello Drone programming with gesture-based image processing interfaces, showcasing the tool's powerful capability to enhance verification processes while maintaining the agility required for today's fast-paced development cycles.
Personalization stands as the cornerstone of recommender systems (RecSys), striving to sift out redundant information and offer tailor-made services for users. However, the conventional cloud-based RecSys necessitates centralized data collection, posing significant risks of user privacy breaches. In response to this challenge, federated recommender systems (FedRecSys) have emerged, garnering considerable attention. FedRecSys enable users to retain personal data locally and solely share model parameters with low privacy sensitivity for global model training, significantly bolstering the system's privacy protection capabilities. Within the distributed learning framework, the pronounced non-iid nature of user behavior data introduces fresh hurdles to federated optimization. Meanwhile, the ability of federated learning to concurrently learn multiple models presents an opportunity for personalized user modeling. Consequently, the development of personalized FedRecSys (PFedRecSys) is crucial and holds substantial significance. This tutorial seeks to provide an introduction to PFedRecSys, encompassing (1) an overview of existing studies on PFedRecSys, (2) a comprehensive taxonomy of PFedRecSys spanning four pivotal research directions-client-side adaptation, server-side aggregation, communication efficiency, privacy and protection, and (3) exploration of open challenges and promising future directions in PFedRecSys. This tutorial aims to establish a robust foundation and spark new perspectives for subsequent exploration and practical implementations in the evolving realm of RecSys.
Classical optimization theory requires a small step-size for gradient-based methods to converge. Nevertheless, recent findings challenge the traditional idea by empirically demonstrating Gradient Descent (GD) converges even when the step-size $\eta$ exceeds the threshold of $2/L$, where $L$ is the global smooth constant. This is usually known as the Edge of Stability (EoS) phenomenon. A widely held belief suggests that an objective function with subquadratic growth plays an important role in incurring EoS. In this paper, we provide a more comprehensive answer by considering the task of finding linear interpolator $\beta \in R^{d}$ for regression with loss function $l(\cdot)$, where $\beta$ admits parameterization as $\beta = w^2_{+} - w^2_{-}$. Contrary to the previous work that suggests a subquadratic $l$ is necessary for EoS, our novel finding reveals that EoS occurs even when $l$ is quadratic under proper conditions. This argument is made rigorous by both empirical and theoretical evidence, demonstrating the GD trajectory converges to a linear interpolator in a non-asymptotic way. Moreover, the model under quadratic $l$, also known as a depth-$2$ diagonal linear network, remains largely unexplored under the EoS regime. Our analysis then sheds some new light on the implicit bias of diagonal linear networks when a larger step-size is employed, enriching the understanding of EoS on more practical models.
Intent inferral, the process by which a robotic device predicts a user's intent from biosignals, offers an effective and intuitive way to control wearable robots. Classical intent inferral methods treat biosignal inputs as unidirectional ground truths for training machine learning models, where the internal state of the model is not directly observable by the user. In this work, we propose reciprocal learning, a bidirectional paradigm that facilitates human adaptation to an intent inferral classifier. Our paradigm consists of iterative, interwoven stages that alternate between updating machine learning models and guiding human adaptation with the use of augmented visual feedback. We demonstrate this paradigm in the context of controlling a robotic hand orthosis for stroke, where the device predicts open, close, and relax intents from electromyographic (EMG) signals and provides appropriate assistance. We use LED progress-bar displays to communicate to the user the predicted probabilities for open and close intents by the classifier. Our experiments with stroke subjects show reciprocal learning improving performance in a subset of subjects (two out of five) without negatively impacting performance on the others. We hypothesize that, during reciprocal learning, subjects can learn to reproduce more distinguishable muscle activation patterns and generate more separable biosignals.
In pace with developments in the research field of artificial intelligence, knowledge graphs (KGs) have attracted a surge of interest from both academia and industry. As a representation of semantic relations between entities, KGs have proven to be particularly relevant for natural language processing (NLP), experiencing a rapid spread and wide adoption within recent years. Given the increasing amount of research work in this area, several KG-related approaches have been surveyed in the NLP research community. However, a comprehensive study that categorizes established topics and reviews the maturity of individual research streams remains absent to this day. Contributing to closing this gap, we systematically analyzed 507 papers from the literature on KGs in NLP. Our survey encompasses a multifaceted review of tasks, research types, and contributions. As a result, we present a structured overview of the research landscape, provide a taxonomy of tasks, summarize our findings, and highlight directions for future work.