We provide a literature review about Automatic Text Summarization (ATS) systems. We consider a citation-based approach. We start with some popular and well-known papers that we have in hand about each topic we want to cover and we have tracked the "backward citations" (papers that are cited by the set of papers we knew beforehand) and the "forward citations" (newer papers that cite the set of papers we knew beforehand). In order to organize the different methods, we present the diverse approaches to ATS guided by the mechanisms they use to generate a summary. Besides presenting the methods, we also present an extensive review of the datasets available for summarization tasks and the methods used to evaluate the quality of the summaries. Finally, we present an empirical exploration of these methods using the CNN Corpus dataset that provides golden summaries for extractive and abstractive methods.
Building efficient, accurate and generalizable reduced order models of developed turbulence remains a major challenge. This manuscript approaches this problem by developing a hierarchy of parameterized reduced Lagrangian models for turbulent flows, and investigates the effects of enforcing physical structure through Smoothed Particle Hydrodynamics (SPH) versus relying on neural networks (NN)s as universal function approximators. Starting from Neural Network (NN) parameterizations of a Lagrangian acceleration operator, this hierarchy of models gradually incorporates a weakly compressible and parameterized SPH framework, which enforces physical symmetries, such as Galilean, rotational and translational invariances. Within this hierarchy, two new parameterized smoothing kernels are developed in order to increase the flexibility of the learn-able SPH simulators. For each model we experiment with different loss functions which are minimized using gradient based optimization, where efficient computations of gradients are obtained by using Automatic Differentiation (AD) and Sensitivity Analysis (SA). Each model within the hierarchy is trained on two data sets associated with weekly compressible Homogeneous Isotropic Turbulence (HIT): (1) a validation set using weakly compressible SPH; and (2) a high fidelity set from Direct Numerical Simulations (DNS). Numerical evidence shows that encoding more SPH structure improves generalizability to different turbulent Mach numbers and time shifts, and that including the novel parameterized smoothing kernels improves the accuracy of SPH at the resolved scales.
We combine Kronecker products, and quantitative information flow, to give a novel formal analysis for the fine-grained verification of utility in complex privacy pipelines. The combination explains a surprising anomaly in the behaviour of utility of privacy-preserving pipelines -- that sometimes a reduction in privacy results also in a decrease in utility. We use the standard measure of utility for Bayesian analysis, introduced by Ghosh at al., to produce tractable and rigorous proofs of the fine-grained statistical behaviour leading to the anomaly. More generally, we offer the prospect of formal-analysis tools for utility that complement extant formal analyses of privacy. We demonstrate our results on a number of common privacy-preserving designs.
The present work proposes a framework for nonlinear model order reduction based on a Graph Convolutional Autoencoder (GCA-ROM). In the reduced order modeling (ROM) context, one is interested in obtaining real-time and many-query evaluations of parametric Partial Differential Equations (PDEs). Linear techniques such as Proper Orthogonal Decomposition (POD) and Greedy algorithms have been analyzed thoroughly, but they are more suitable when dealing with linear and affine models showing a fast decay of the Kolmogorov n-width. On one hand, the autoencoder architecture represents a nonlinear generalization of the POD compression procedure, allowing one to encode the main information in a latent set of variables while extracting their main features. On the other hand, Graph Neural Networks (GNNs) constitute a natural framework for studying PDE solutions defined on unstructured meshes. Here, we develop a non-intrusive and data-driven nonlinear reduction approach, exploiting GNNs to encode the reduced manifold and enable fast evaluations of parametrized PDEs. We show the capabilities of the methodology for several models: linear/nonlinear and scalar/vector problems with fast/slow decay in the physically and geometrically parametrized setting. The main properties of our approach consist of (i) high generalizability in the low-data regime even for complex regimes, (ii) physical compliance with general unstructured grids, and (iii) exploitation of pooling and un-pooling operations to learn from scattered data.
This work presents a novel global digital image correlation (DIC) method, based on a newly developed convolution finite element (C-FE) approximation. The convolution approximation can rely on the mesh of linear finite elements and enables arbitrarily high order approximations without adding more degrees of freedom. Therefore, the C-FE based DIC can be more accurate than {the} usual FE based DIC by providing highly smooth and accurate displacement and strain results with the same element size. The detailed formulation and implementation of the method have been discussed in this work. The controlling parameters in the method include the polynomial order, patch size, and dilation. A general choice of the parameters and their potential adaptivity have been discussed. The proposed DIC method has been tested by several representative examples, including the DIC challenge 2.0 benchmark problems, with comparison to the usual FE based DIC. C-FE outperformed FE in all the DIC results for the tested examples. This work demonstrates the potential of C-FE and opens a new avenue to enable highly smooth, accurate, and robust DIC analysis for full-field displacement and strain measurements.
The investigation of mixture models is a key to understand and visualize the distribution of multivariate data. Most mixture models approaches are based on likelihoods, and are not adapted to distribution with finite support or without a well-defined density function. This study proposes the Augmented Quantization method, which is a reformulation of the classical quantization problem but which uses the p-Wasserstein distance. This metric can be computed in very general distribution spaces, in particular with varying supports. The clustering interpretation of quantization is revisited in a more general framework. The performance of Augmented Quantization is first demonstrated through analytical toy problems. Subsequently, it is applied to a practical case study involving river flooding, wherein mixtures of Dirac and Uniform distributions are built in the input space, enabling the identification of the most influential variables.
Motivation: Curation of literature in life sciences is a growing challenge. The continued increase in the rate of publication, coupled with the relatively fixed number of curators worldwide presents a major challenge to developers of biomedical knowledgebases. Very few knowledgebases have resources to scale to the whole relevant literature and all have to prioritise their efforts. Results: In this work, we take a first step to alleviating the lack of curator time in RNA science by generating summaries of literature for non-coding RNAs using large language models (LLMs). We demonstrate that high-quality, factually accurate summaries with accurate references can be automatically generated from the literature using a commercial LLM and a chain of prompts and checks. Manual assessment was carried out for a subset of summaries, with the majority being rated extremely high quality. We also applied the most commonly used automated evaluation approaches, finding that they do not correlate with human assessment. Finally, we apply our tool to a selection of over 4,600 ncRNAs and make the generated summaries available via the RNAcentral resource. We conclude that automated literature summarization is feasible with the current generation of LLMs, provided careful prompting and automated checking are applied. Availability: Code used to produce these summaries can be found here: //github.com/RNAcentral/litscan-summarization and the dataset of contexts and summaries can be found here: //huggingface.co/datasets/RNAcentral/litsumm-v1. Summaries are also displayed on the RNA report pages in RNAcentral (//rnacentral.org/)
We demonstrate a validity problem of machine learning in the vital application area of disease diagnosis in medicine. It arises when target labels in training data are determined by an indirect measurement, and the fundamental measurements needed to determine this indirect measurement are included in the input data representation. Machine learning models trained on this data will learn nothing else but to exactly reconstruct the known target definition. Such models show perfect performance on similarly constructed test data but will fail catastrophically on real-world examples where the defining fundamental measurements are not or only incompletely available. We present a general procedure allowing identification of problematic datasets and black-box machine learning models trained on them, and exemplify our detection procedure on the task of early prediction of sepsis.
Synthetic biologists and molecular programmers design novel nucleic acid reactions, with many potential applications. Good visualization tools are needed to help domain experts make sense of the complex outputs of folding pathway simulations of such reactions. Here we present ViDa, a new approach for visualizing DNA reaction folding trajectories over the energy landscape of secondary structures. We integrate a deep graph embedding model with common dimensionality reduction approaches, to map high-dimensional data onto 2D Euclidean space. We assess ViDa on two well-studied and contrasting DNA hybridization reactions. Our preliminary results suggest that ViDa's visualization successfully separates trajectories with different folding mechanisms, thereby providing useful insight to users, and is a big improvement over the current state-of-the-art in DNA kinetics visualization.
We collect robust proposals given in the field of regression models with heteroscedastic errors. Our motivation stems from the fact that the practitioner frequently faces the confluence of two phenomena in the context of data analysis: non--linearity and heteroscedasticity. The impact of heteroscedasticity on the precision of the estimators is well--known, however the conjunction of these two phenomena makes handling outliers more difficult. An iterative procedure to estimate the parameters of a heteroscedastic non--linear model is considered. The studied estimators combine weighted $MM-$regression estimators, to control the impact of high leverage points, and a robust method to estimate the parameters of the variance function.
This paper introduces an extended tensor decomposition (XTD) method for model reduction. The proposed method is based on a sparse non-separated enrichment to the conventional tensor decomposition, which is expected to improve the approximation accuracy and the reducibility (compressibility) in highly nonlinear and singular cases. The proposed XTD method can be a powerful tool for solving nonlinear space-time parametric problems. The method has been successfully applied to parametric elastic-plastic problems and real time additive manufacturing residual stress predictions with uncertainty quantification. Furthermore, a combined XTD-SCA (self-consistent clustering analysis) strategy has been presented for multi-scale material modeling, which enables real time multi-scale multi-parametric simulations. The efficiency of the method is demonstrated with comparison to finite element analysis. The proposed method enables a novel framework for fast manufacturing and material design with uncertainties.