The innovation of next-generation sequencing (NGS) techniques has significantly reduced the price of genome sequencing, lowering barriers to future medical research; it is now feasible to apply genome sequencing to studies where it would have previously been cost-inefficient. Identifying damaging or pathogenic mutations in vast amounts of complex, high-dimensional genome sequencing data may be of particular interest to researchers. Thus, this paper's aims were to train machine learning models on the attributes of a genetic mutation to predict LoFtool scores (which measure a gene's intolerance to loss-of-function mutations). These attributes included, but were not limited to, the position of a mutation on a chromosome, changes in amino acids, and changes in codons caused by the mutation. Models were built using the univariate feature selection technique f-regression combined with K-nearest neighbors (KNN), Support Vector Machine (SVM), Random Sample Consensus (RANSAC), Decision Trees, Random Forest, and Extreme Gradient Boosting (XGBoost). These models were evaluated using five-fold cross-validated averages of r-squared, mean squared error, root mean squared error, mean absolute error, and explained variance. The findings of this study include the training of multiple models with testing set r-squared values of 0.97.
Generative diffusion models have achieved spectacular performance in many areas of generative modeling. While the fundamental ideas behind these models come from non-equilibrium physics, variational inference and stochastic calculus, in this paper we show that many aspects of these models can be understood using the tools of equilibrium statistical mechanics. Using this reformulation, we show that generative diffusion models undergo second-order phase transitions corresponding to symmetry breaking phenomena. We show that these phase-transitions are always in a mean-field universality class, as they are the result of a self-consistency condition in the generative dynamics. We argue that the critical instability that arises from the phase transitions lies at the heart of their generative capabilities, which are characterized by a set of mean field critical exponents. Furthermore, using the statistical physics of disordered systems, we show that memorization can be understood as a form of critical condensation corresponding to a disordered phase transition. Finally, we show that the dynamic equation of the generative process can be interpreted as a stochastic adiabatic transformation that minimizes the free energy while keeping the system in thermal equilibrium.
In many communication contexts, the capabilities of the involved actors cannot be known beforehand, whether it is a cell, a plant, an insect, or even a life form unknown to Earth. Regardless of the recipient, the message space and time scale could be too fast, too slow, too large, or too small and may never be decoded. Therefore, it pays to devise a way to encode messages agnostic of space and time scales. We propose the use of fractal functions as self-executable infinite-frequency carriers for sending messages, given their properties of structural self-similarity and scale invariance. We call it `fractal messaging'. Starting from a spatial embedding, we introduce a framework for a space-time scale-free messaging approach to this challenge. When considering a space and time-agnostic framework for message transmission, it would be interesting to encode a message such that it could be decoded at several spatio-temporal scales. Hence, the core idea of the framework proposed herein is to encode a binary message as waves along infinitely many frequencies (in power-like distributions) and amplitudes, transmit such a message, and then decode and reproduce it. To do so, the components of the Weierstrass function, a known fractal, are used as carriers of the message. Each component will have its amplitude modulated to embed the binary stream, allowing for a space-time-agnostic approach to messaging.
This work presents a comparative review and classification between some well-known thermodynamically consistent models of hydrogel behavior in a large deformation setting, specifically focusing on solvent absorption/desorption and its impact on mechanical deformation and network swelling. The proposed discussion addresses formulation aspects, general mathematical classification of the governing equations, and numerical implementation issues based on the finite element method. The theories are presented in a unified framework demonstrating that, despite not being evident in some cases, all of them follow equivalent thermodynamic arguments. A detailed numerical analysis is carried out where Taylor-Hood elements are employed in the spatial discretization to satisfy the inf-sup condition and to prevent spurious numerical oscillations. The resulting discrete problems are solved using the FEniCS platform through consistent variational formulations, employing both monolithic and staggered approaches. We conduct benchmark tests on various hydrogel structures, demonstrating that major differences arise from the chosen volumetric response of the hydrogel. The significance of this choice is frequently underestimated in the state-of-the-art literature but has been shown to have substantial implications on the resulting hydrogel behavior.
Deep learning has achieved widespread success in medical image analysis, leading to an increasing demand for large-scale expert-annotated medical image datasets. Yet, the high cost of annotating medical images severely hampers the development of deep learning in this field. To reduce annotation costs, active learning aims to select the most informative samples for annotation and train high-performance models with as few labeled samples as possible. In this survey, we review the core methods of active learning, including the evaluation of informativeness and sampling strategy. For the first time, we provide a detailed summary of the integration of active learning with other label-efficient techniques, such as semi-supervised, self-supervised learning, and so on. We also summarize active learning works that are specifically tailored to medical image analysis. Additionally, we conduct a thorough comparative analysis of the performance of different AL methods in medical image analysis with experiments. In the end, we offer our perspectives on the future trends and challenges of active learning and its applications in medical image analysis.
Data augmentation is arguably the most important regularization technique commonly used to improve generalization performance of machine learning models. It primarily involves the application of appropriate data transformation operations to create new data samples with desired properties. Despite its effectiveness, the process is often challenging because of the time-consuming trial and error procedures for creating and testing different candidate augmentations and their hyperparameters manually. Automated data augmentation methods aim to automate the process. State-of-the-art approaches typically rely on automated machine learning (AutoML) principles. This work presents a comprehensive survey of AutoML-based data augmentation techniques. We discuss various approaches for accomplishing data augmentation with AutoML, including data manipulation, data integration and data synthesis techniques. We present extensive discussion of techniques for realizing each of the major subtasks of the data augmentation process: search space design, hyperparameter optimization and model evaluation. Finally, we carried out an extensive comparison and analysis of the performance of automated data augmentation techniques and state-of-the-art methods based on classical augmentation approaches. The results show that AutoML methods for data augmentation currently outperform state-of-the-art techniques based on conventional approaches.
This paper proposes a~simple, yet powerful, method for balancing distributions of covariates for causal inference based on observational studies. The method makes it possible to balance an arbitrary number of quantiles (e.g., medians, quartiles, or deciles) together with means if necessary. The proposed approach is based on the theory of calibration estimators (Deville and S\"arndal 1992), in particular, calibration estimators for quantiles, proposed by Harms and Duchesne (2006). The method does not require numerical integration, kernel density estimation or assumptions about the distributions. Valid estimates can be obtained by drawing on existing asymptotic theory. An~illustrative example of the proposed approach is presented for the entropy balancing method and the covariate balancing propensity score method. Results of a~simulation study indicate that the method efficiently estimates average treatment effects on the treated (ATT), the average treatment effect (ATE), the quantile treatment effect on the treated (QTT) and the quantile treatment effect (QTE), especially in the presence of non-linearity and mis-specification of the models. The proposed approach can be further generalized to other designs (e.g. multi-category, continuous) or methods (e.g. synthetic control method). An open source software implementing proposed methods is available.
In the era of Internet of Things, how to develop a smart sensor system with sustainable power supply, easy deployment and flexible use has become a difficult problem to be solved. The traditional power supply has problems such as frequent replacement or charging when in use, which limits the development of wearable devices. The contact-to-separate friction nanogenerator (TENG) was prepared by using polychotomy thy lene (PTFE) and aluminum (AI) foils. Human motion energy was collected by human body arrangement, and human motion posture was monitored according to the changes of output electrical signals. In 2012, Academician Wang Zhong lin and his team invented the triboelectric nanogenerator (TENG), which uses Maxwell displacement current as a driving force to directly convert mechanical stimuli into electrical signals, so it can be used as a self-driven sensor. Teng-based sensors have the advantages of simple structure and high instantaneous power density, which provides an important means for building intelligent sensor systems. At the same time, machine learning, as a technology with low cost, short development cycle, strong data processing ability and prediction ability, has a significant effect on the processing of a large number of electrical signals generated by TENG, and the combination with TENG sensors will promote the rapid development of intelligent sensor networks in the future. Therefore, this paper is based on the intelligent sound monitoring and recognition system of TENG, which has good sound recognition capability, and aims to evaluate the feasibility of the sound perception module architecture in ubiquitous sensor networks.
We consider a statistical model for symmetric matrix factorization with additive Gaussian noise in the high-dimensional regime where the rank $M$ of the signal matrix to infer scales with its size $N$ as $M = o(N^{1/10})$. Allowing for a $N$-dependent rank offers new challenges and requires new methods. Working in the Bayesian-optimal setting, we show that whenever the signal has i.i.d. entries the limiting mutual information between signal and data is given by a variational formula involving a rank-one replica symmetric potential. In other words, from the information-theoretic perspective, the case of a (slowly) growing rank is the same as when $M = 1$ (namely, the standard spiked Wigner model). The proof is primarily based on a novel multiscale cavity method allowing for growing rank along with some information-theoretic identities on worst noise for the Gaussian vector channel. We believe that the cavity method developed here will play a role in the analysis of a broader class of inference and spin models where the degrees of freedom are large arrays instead of vectors.
Most state-of-the-art machine learning techniques revolve around the optimisation of loss functions. Defining appropriate loss functions is therefore critical to successfully solving problems in this field. We present a survey of the most commonly used loss functions for a wide range of different applications, divided into classification, regression, ranking, sample generation and energy based modelling. Overall, we introduce 33 different loss functions and we organise them into an intuitive taxonomy. Each loss function is given a theoretical backing and we describe where it is best used. This survey aims to provide a reference of the most essential loss functions for both beginner and advanced machine learning practitioners.
The remarkable practical success of deep learning has revealed some major surprises from a theoretical perspective. In particular, simple gradient methods easily find near-optimal solutions to non-convex optimization problems, and despite giving a near-perfect fit to training data without any explicit effort to control model complexity, these methods exhibit excellent predictive accuracy. We conjecture that specific principles underlie these phenomena: that overparametrization allows gradient methods to find interpolating solutions, that these methods implicitly impose regularization, and that overparametrization leads to benign overfitting. We survey recent theoretical progress that provides examples illustrating these principles in simpler settings. We first review classical uniform convergence results and why they fall short of explaining aspects of the behavior of deep learning methods. We give examples of implicit regularization in simple settings, where gradient methods lead to minimal norm functions that perfectly fit the training data. Then we review prediction methods that exhibit benign overfitting, focusing on regression problems with quadratic loss. For these methods, we can decompose the prediction rule into a simple component that is useful for prediction and a spiky component that is useful for overfitting but, in a favorable setting, does not harm prediction accuracy. We focus specifically on the linear regime for neural networks, where the network can be approximated by a linear model. In this regime, we demonstrate the success of gradient flow, and we consider benign overfitting with two-layer networks, giving an exact asymptotic analysis that precisely demonstrates the impact of overparametrization. We conclude by highlighting the key challenges that arise in extending these insights to realistic deep learning settings.