A key problem toward the use of microorganisms as bio-factories is reaching and maintaining cellular communities at a desired density and composition so that they can efficiently convert their biomass into useful compounds. Promising technological platforms for the real time, scalable control of cellular density are bioreactors. In this work, we developed a learning-based strategy to expand the toolbox of available control algorithms capable of regulating the density of a \textit{single} bacterial population in bioreactors. Specifically, we used a sim-to-real paradigm, where a simple mathematical model, calibrated using a few data, was adopted to generate synthetic data for the training of the controller. The resulting policy was then exhaustively tested in vivo using a low-cost bioreactor known as Chi.Bio, assessing performance and robustness. In addition, we compared the performance with more traditional controllers (namely, a PI and an MPC), confirming that the learning-based controller exhibits similar performance in vivo. Our work showcases the viability of learning-based strategies for the control of cellular density in bioreactors, making a step forward toward their use for the control of the composition of microbial consortia.
Spatially distributed functional data are prevalent in many statistical applications such as meteorology, energy forecasting, census data, disease mapping, and neurological studies. Given their complex and high-dimensional nature, functional data often require dimension reduction methods to extract meaningful information. Inverse regression is one such approach that has become very popular in the past two decades. We study the inverse regression in the framework of functional data observed at irregularly positioned spatial sites. The functional predictor is the sum of a spatially dependent functional effect and a spatially independent functional nugget effect, while the relation between the scalar response and the functional predictor is modeled using the inverse regression framework. For estimation, we consider local linear smoothing with a general weighting scheme, which includes as special cases the schemes under which equal weights are assigned to each observation or to each subject. This framework enables us to present the asymptotic results for different types of sampling plans over time such as non-dense, dense, and ultra-dense. We discuss the domain-expanding infill (DEI) framework for spatial asymptotics, which is a mix of the traditional expanding domain and infill frameworks. The DEI framework overcomes the limitations of traditional spatial asymptotics in the existing literature. Under this unified framework, we develop asymptotic theory and identify conditions that are necessary for the estimated eigen-directions to achieve optimal rates of convergence. Our asymptotic results include pointwise and $L_2$ convergence rates. Simulation studies using synthetic data and an application to a real-world dataset confirm the effectiveness of our methods.
One of the challenges of natural language understanding is to deal with the subjectivity of sentences, which may express opinions and emotions that add layers of complexity and nuance. Sentiment analysis is a field that aims to extract and analyze these subjective elements from text, and it can be applied at different levels of granularity, such as document, paragraph, sentence, or aspect. Aspect-based sentiment analysis is a well-studied topic with many available data sets and models. However, there is no clear definition of what makes a sentence difficult for aspect-based sentiment analysis. In this paper, we explore this question by conducting an experiment with three data sets: "Laptops", "Restaurants", and "MTSC" (Multi-Target-dependent Sentiment Classification), and a merged version of these three datasets. We study the impact of domain diversity and syntactic diversity on difficulty. We use a combination of classifiers to identify the most difficult sentences and analyze their characteristics. We employ two ways of defining sentence difficulty. The first one is binary and labels a sentence as difficult if the classifiers fail to correctly predict the sentiment polarity. The second one is a six-level scale based on how many of the top five best-performing classifiers can correctly predict the sentiment polarity. We also define 9 linguistic features that, combined, aim at estimating the difficulty at sentence level.
In order to minimize the generalization error in neural networks, a novel technique to identify overfitting phenomena when training the learner is formally introduced. This enables support of a reliable and trustworthy early stopping condition, thus improving the predictive power of that type of modeling. Our proposal exploits the correlation over time in a collection of online indicators, namely characteristic functions for indicating if a set of hypotheses are met, associated with a range of independent stopping conditions built from a canary judgment to evaluate the presence of overfitting. That way, we provide a formal basis for decision making in terms of interrupting the learning process. As opposed to previous approaches focused on a single criterion, we take advantage of subsidiarities between independent assessments, thus seeking both a wider operating range and greater diagnostic reliability. With a view to illustrating the effectiveness of the halting condition described, we choose to work in the sphere of natural language processing, an operational continuum increasingly based on machine learning. As a case study, we focus on parser generation, one of the most demanding and complex tasks in the domain. The selection of cross-validation as a canary function enables an actual comparison with the most representative early stopping conditions based on overfitting identification, pointing to a promising start toward an optimal bias and variance control.
A Gaussian process is proposed as a model for the posterior distribution of the local predictive ability of a model or expert, conditional on a vec- tor of covariates, from historical predictions in the form of log predictive scores. Assuming Gaussian expert predictions and a Gaussian data generat- ing process, a linear transformation of the predictive score follows a noncen- tral chi-squared distribution with one degree of freedom. Motivated by this we develop a non-central chi-squared Gaussian process regression to flexibly model local predictive ability, with the posterior distribution of the latent GP function and kernel hyperparameters sampled by Hamiltonian Monte Carlo. We show that a cube-root transformation of the log scores is approximately Gaussian with homoscedastic variance, which makes it possible to estimate the model much faster by marginalizing the latent GP function analytically. Linear pools based on learned local predictive ability are applied to predict daily bike usage in Washington DC.
This study focuses on how different modalities of human communication can be used to distinguish between healthy controls and subjects with schizophrenia who exhibit strong positive symptoms. We developed a multi-modal schizophrenia classification system using audio, video, and text. Facial action units and vocal tract variables were extracted as low-level features from video and audio respectively, which were then used to compute high-level coordination features that served as the inputs to the audio and video modalities. Context-independent text embeddings extracted from transcriptions of speech were used as the input for the text modality. The multi-modal system is developed by fusing a segment-to-session-level classifier for video and audio modalities with a text model based on a Hierarchical Attention Network (HAN) with cross-modal attention. The proposed multi-modal system outperforms the previous state-of-the-art multi-modal system by 8.53% in the weighted average F1 score.
Revealing hidden dynamics from the stochastic data is a challenging problem as randomness takes part in the evolution of the data. The problem becomes exceedingly complex when the trajectories of the stochastic data are absent in many scenarios. Here we present an approach to effectively modeling the dynamics of the stochastic data without trajectories based on the weak form of the Fokker-Planck (FP) equation, which governs the evolution of the density function in the Brownian process. Taking the collocations of Gaussian functions as the test functions in the weak form of the FP equation, we transfer the derivatives to the Gaussian functions and thus approximate the weak form by the expectational sum of the data. With a dictionary representation of the unknown terms, a linear system is built and then solved by the regression, revealing the unknown dynamics of the data. Hence, we name the method with the Weak Collocation Regression (WCR) method for its three key components: weak form, collocation of Gaussian kernels, and regression. The numerical experiments show that our method is flexible and fast, which reveals the dynamics within seconds in multi-dimensional problems and can be easily extended to high-dimensional data such as 20 dimensions. WCR can also correctly identify the hidden dynamics of the complex tasks with variable-dependent diffusion and coupled drift, and the performance is robust, achieving high accuracy in the case with noise added.
AI-enabled synthetic biology has tremendous potential but also significantly increases biorisks and brings about a new set of dual use concerns. The picture is complicated given the vast innovations envisioned to emerge by combining emerging technologies, as AI-enabled synthetic biology potentially scales up bioengineering into industrial biomanufacturing. However, the literature review indicates that goals such as maintaining a reasonable scope for innovation, or more ambitiously to foster a huge bioeconomy don't necessarily contrast with biosafety, but need to go hand in hand. This paper presents a literature review of the issues and describes emerging frameworks for policy and practice that transverse the options of command-and control, stewardship, bottom-up, and laissez-faire governance. How to achieve early warning systems that enable prevention and mitigation of future AI-enabled biohazards from the lab, from deliberate misuse, or from the public realm, will constantly need to evolve, and adaptive, interactive approaches should emerge. Although biorisk is subject to an established governance regime, and scientists generally adhere to biosafety protocols, even experimental, but legitimate use by scientists could lead to unexpected developments. Recent advances in chatbots enabled by generative AI have revived fears that advanced biological insight can more easily get into the hands of malignant individuals or organizations. Given these sets of issues, society needs to rethink how AI-enabled synthetic biology should be governed. The suggested way to visualize the challenge at hand is whack-a-mole governance, although the emerging solutions are perhaps not so different either.
Effect modification occurs when the impact of the treatment on an outcome varies based on the levels of other covariates known as effect modifiers. Modeling of these effect differences is important for etiological goals and for purposes of optimizing treatment. Structural nested mean models (SNMMs) are useful causal models for estimating the potentially heterogeneous effect of a time-varying exposure on the mean of an outcome in the presence of time-varying confounding. In longitudinal health studies, information on many demographic, behavioural, biological, and clinical covariates may be available, among which some might cause heterogeneous treatment effects. A data-driven approach for selecting the effect modifiers of an exposure may be necessary if these effect modifiers are \textit{a priori} unknown and need to be identified. Although variable selection techniques are available in the context of estimating conditional average treatment effects using marginal structural models, or in the context of estimating optimal dynamic treatment regimens, all of these methods consider an outcome measured at a single point in time. In the context of an SNMM for repeated outcomes, we propose a doubly robust penalized G-estimator for the causal effect of a time-varying exposure with a simultaneous selection of effect modifiers and prove the oracle property of our estimator. We conduct a simulation study to evaluate the performance of the proposed estimator in finite samples and for verification of its double-robustness property. Our work is motivated by a study of hemodiafiltration for treating patients with end-stage renal disease at the Centre Hospitalier de l'Universit\'e de Montr\'eal.
We hypothesize that due to the greedy nature of learning in multi-modal deep neural networks, these models tend to rely on just one modality while under-fitting the other modalities. Such behavior is counter-intuitive and hurts the models' generalization, as we observe empirically. To estimate the model's dependence on each modality, we compute the gain on the accuracy when the model has access to it in addition to another modality. We refer to this gain as the conditional utilization rate. In the experiments, we consistently observe an imbalance in conditional utilization rates between modalities, across multiple tasks and architectures. Since conditional utilization rate cannot be computed efficiently during training, we introduce a proxy for it based on the pace at which the model learns from each modality, which we refer to as the conditional learning speed. We propose an algorithm to balance the conditional learning speeds between modalities during training and demonstrate that it indeed addresses the issue of greedy learning. The proposed algorithm improves the model's generalization on three datasets: Colored MNIST, Princeton ModelNet40, and NVIDIA Dynamic Hand Gesture.
Graph representation learning for hypergraphs can be used to extract patterns among higher-order interactions that are critically important in many real world problems. Current approaches designed for hypergraphs, however, are unable to handle different types of hypergraphs and are typically not generic for various learning tasks. Indeed, models that can predict variable-sized heterogeneous hyperedges have not been available. Here we develop a new self-attention based graph neural network called Hyper-SAGNN applicable to homogeneous and heterogeneous hypergraphs with variable hyperedge sizes. We perform extensive evaluations on multiple datasets, including four benchmark network datasets and two single-cell Hi-C datasets in genomics. We demonstrate that Hyper-SAGNN significantly outperforms the state-of-the-art methods on traditional tasks while also achieving great performance on a new task called outsider identification. Hyper-SAGNN will be useful for graph representation learning to uncover complex higher-order interactions in different applications.