In recent years, the scientific community has become increasingly interested on peptides with non-canonical amino acids due to their superior stability and resistance to proteolytic degradation. These peptides present promising modifications to biological, pharmacological, and physiochemical attributes in both endogenous and engineered peptides. Notwithstanding their considerable advantages, the scientific community exhibits a conspicuous absence of an effective pre-trained model adept at distilling feature representations from such complex peptide sequences. We herein propose PepLand, a novel pre-training architecture for representation and property analysis of peptides spanning both canonical and non-canonical amino acids. In essence, PepLand leverages a comprehensive multi-view heterogeneous graph neural network tailored to unveil the subtle structural representations of peptides. Empirical validations underscore PepLand's effectiveness across an array of peptide property predictions, encompassing protein-protein interactions, permeability, solubility, and synthesizability. The rigorous evaluation confirms PepLand's unparalleled capability in capturing salient synthetic peptide features, thereby laying a robust foundation for transformative advances in peptide-centric research domains. We have made all the source code utilized in this study publicly accessible via GitHub at //github.com/zhangruochi/pepland
Twitter as one of the most popular social networks, offers a means for communication and online discourse, which unfortunately has been the target of bots and fake accounts, leading to the manipulation and spreading of false information. Towards this end, we gather a challenging, multilingual dataset of social discourse on Twitter, originating from 9M users regarding the recent Russo-Ukrainian war, in order to detect the bot accounts and the conversation involving them. We collect the ground truth for our dataset through the Twitter API suspended accounts collection, containing approximately 343K of bot accounts and 8M of normal users. Additionally, we use a dataset provided by Botometer-V3 with 1,777 Varol, 483 German accounts, and 1,321 US accounts. Besides the publicly available datasets, we also manage to collect 2 independent datasets around popular discussion topics of the 2022 energy crisis and the 2022 conspiracy discussions. Both of the datasets were labeled according to the Twitter suspension mechanism. We build a novel ML model for bot detection using the state-of-the-art XGBoost model. We combine the model with a high volume of labeled tweets according to the Twitter suspension mechanism ground truth. This requires a limited set of profile features allowing labeling of the dataset in different time periods from the collection, as it is independent of the Twitter API. In comparison with Botometer our methodology achieves an average 11% higher ROC-AUC score over two real-case scenario datasets.
Bayesian model-averaged hypothesis testing is an important technique in regression because it addresses the problem that the evidence one variable directly affects an outcome often depends on which other variables are included in the model. This problem is caused by confounding and mediation, and is pervasive in big data settings with thousands of variables. However, model-averaging is under-utilized in fields, like epidemiology, where classical statistical approaches dominate. Here we show that simultaneous Bayesian and frequentist model-averaged hypothesis testing is possible in large samples, for a family of priors. We show that Bayesian model-averaged regression is a closed testing procedure, and use the theory of regular variation to derive interchangeable posterior odds and $p$-values that jointly control the Bayesian false discovery rate (FDR), the frequentist type I error rate, and the frequentist familywise error rate (FWER). These results arise from an asymptotic chi-squared distribution for the model-averaged deviance, under the null hypothesis. We call the approach 'Doublethink'. In a related manuscript (Arning, Fryer and Wilson, 2024), we apply it to discovering direct risk factors for COVID-19 hospitalization in UK Biobank, and we discuss its broader implications for bridging the differences between Bayesian and frequentist hypothesis testing.
This work focuses on the task of property targeting: that is, generating molecules conditioned on target chemical properties to expedite candidate screening for novel drug and materials development. DiGress is a recent diffusion model for molecular graphs whose distinctive feature is allowing property targeting through classifier-based (CB) guidance. While CB guidance may work to generate molecular-like graphs, we hint at the fact that its assumptions apply poorly to the chemical domain. Based on this insight we propose a classifier-free DiGress (FreeGress), which works by directly injecting the conditioning information into the training process. CF guidance is convenient given its less stringent assumptions and since it does not require to train an auxiliary property regressor, thus halving the number of trainable parameters in the model. We empirically show that our model yields up to 79% improvement in Mean Absolute Error with respect to DiGress on property targeting tasks on QM9 and ZINC-250k benchmarks. As an additional contribution, we propose a simple yet powerful approach to improve chemical validity of generated samples, based on the observation that certain chemical properties such as molecular weight correlate with the number of atoms in molecules.
We propose to enhance the training of physics-informed neural networks (PINNs). To this aim, we introduce nonlinear additive and multiplicative preconditioning strategies for the widely used L-BFGS optimizer. The nonlinear preconditioners are constructed by utilizing the Schwarz domain-decomposition framework, where the parameters of the network are decomposed in a layer-wise manner. Through a series of numerical experiments, we demonstrate that both, additive and multiplicative preconditioners significantly improve the convergence of the standard L-BFGS optimizer, while providing more accurate solutions of the underlying partial differential equations. Moreover, the additive preconditioner is inherently parallel, thus giving rise to a novel approach to model parallelism.
Despite the breakthroughs in biomarker discovery facilitated by differential gene analysis, challenges remain, particularly at the single-cell level. Traditional methodologies heavily rely on user-supplied cell annotations, focusing on individually expressed data, often neglecting the critical interactions between biological conditions, such as healthy versus diseased states. In response, here we introduce scBeacon, an innovative framework built upon a deep contrastive siamese network. scBeacon pioneers an unsupervised approach, adeptly identifying matched cell populations across varied conditions, enabling a refined differential gene analysis. By utilizing a VQ-VAE framework, a contrastive siamese network, and a greedy iterative strategy, scBeacon effectively pinpoints differential genes that hold potential as key biomarkers. Comprehensive evaluations on a diverse array of datasets validate scBeacon's superiority over existing single-cell differential gene analysis tools. Its precision and adaptability underscore its significant role in enhancing diagnostic accuracy in biomarker discovery. With the emphasis on the importance of biomarkers in diagnosis, scBeacon is positioned to be a pivotal asset in the evolution of personalized medicine and targeted treatments.
We develop new tools to study landscapes in nonconvex optimization. Given one optimization problem, we pair it with another by smoothly parametrizing the domain. This is either for practical purposes (e.g., to use smooth optimization algorithms with good guarantees) or for theoretical purposes (e.g., to reveal that the landscape satisfies a strict saddle property). In both cases, the central question is: how do the landscapes of the two problems relate? More precisely: how do desirable points such as local minima and critical points in one problem relate to those in the other problem? A key finding in this paper is that these relations are often determined by the parametrization itself, and are almost entirely independent of the cost function. Accordingly, we introduce a general framework to study parametrizations by their effect on landscapes. The framework enables us to obtain new guarantees for an array of problems, some of which were previously treated on a case-by-case basis in the literature. Applications include: optimizing low-rank matrices and tensors through factorizations; solving semidefinite programs via the Burer-Monteiro approach; training neural networks by optimizing their weights and biases; and quotienting out symmetries.
In this note, we present an abstract approach to study asymptotic orders for adaptive approximations with respect to a monotone set function $\mathfrak{J}$ defined on dyadic cubes. We determine the exact upper order in terms of the critical value of the corresponding $\mathfrak{J}$-partition function, and we are able to provide upper and lower bounds in term of fractal-geometric quantities. With properly chosen $\mathfrak{J}$, our new approach has applications in many different areas of mathematics, including the spectral theory of Krein-Feller operators, quantization dimensions of compactly supported probability measures, and the exact asymptotic order for Kolmogorov, Gelfand and linear widths for Sobolev embeddings into $L_{\mu}^p$-spaces.
In recent literature, for modeling reasons, fractional differential problems have been considered equipped with anti-symmetric boundary conditions. Twenty years ago the anti-reflective boundary conditions were introduced in a context of signal processing and imaging for increasing the quality of the reconstruction of a blurred signal/image contaminated by noise and for reducing the overall complexity to that of few fast sine transforms i.e. to $O(N\log N)$ real arithmetic operations, where $N$ is the number of pixels. Here we consider the anti-symmetric boundary conditions and we introduce the anti-reflective boundary conditions in the context of nonlocal problems of fractional differential type. In the latter context, we study both types of boundary conditions, which in reality are similar in the essentials, from the perspective of computational efficiency, by considering nontruncated and truncated versions. Several numerical tests, tables, and visualizations are provided and critically discussed.
As advancements in artificial intelligence (AI) propel progress in the life sciences, they may also enable the weaponisation and misuse of biological agents. This article differentiates two classes of AI tools that could pose such biosecurity risks: large language models (LLMs) and biological design tools (BDTs). LLMs, such as GPT-4 and its successors, might provide dual-use information and thus remove some barriers encountered by historical biological weapons efforts. As LLMs are turned into multi-modal lab assistants and autonomous science tools, this will increase their ability to support non-experts in performing laboratory work. Thus, LLMs may in particular lower barriers to biological misuse. In contrast, BDTs will expand the capabilities of sophisticated actors. Concretely, BDTs may enable the creation of pandemic pathogens substantially worse than anything seen to date and could enable forms of more predictable and targeted biological weapons. In combination, the convergence of LLMs and BDTs could raise the ceiling of harm from biological agents and could make them broadly accessible. A range of interventions would help to manage risks. Independent pre-release evaluations could help understand the capabilities of models and the effectiveness of safeguards. Options for differentiated access to such tools should be carefully weighed with the benefits of openly releasing systems. Lastly, essential for mitigating risks will be universal and enhanced screening of gene synthesis products.
Graph representation learning for hypergraphs can be used to extract patterns among higher-order interactions that are critically important in many real world problems. Current approaches designed for hypergraphs, however, are unable to handle different types of hypergraphs and are typically not generic for various learning tasks. Indeed, models that can predict variable-sized heterogeneous hyperedges have not been available. Here we develop a new self-attention based graph neural network called Hyper-SAGNN applicable to homogeneous and heterogeneous hypergraphs with variable hyperedge sizes. We perform extensive evaluations on multiple datasets, including four benchmark network datasets and two single-cell Hi-C datasets in genomics. We demonstrate that Hyper-SAGNN significantly outperforms the state-of-the-art methods on traditional tasks while also achieving great performance on a new task called outsider identification. Hyper-SAGNN will be useful for graph representation learning to uncover complex higher-order interactions in different applications.