We present a new class of Langevin based algorithms, which overcomes many of the known shortcomings of popular adaptive optimizers that are currently used for the fine tuning of deep learning models. Its underpinning theory relies on recent advances of Euler's polygonal approximations for stochastic differential equations (SDEs) with monotone coefficients. As a result, it inherits the stability properties of tamed algorithms, while it addresses other known issues, e.g. vanishing gradients in neural networks. In particular, we provide a nonasymptotic analysis and full theoretical guarantees for the convergence properties of an algorithm of this novel class, which we named TH$\varepsilon$O POULA (or, simply, TheoPouLa). Finally, several experiments are presented with different types of deep learning models, which show the superior performance of TheoPouLa over many popular adaptive optimization algorithms.
This work proposes a novel framework for analyzing disease progression using time-aware neural ordinary differential equations (NODE). We introduce a "time-aware head" in a framework trained through self-supervised learning (SSL) to leverage temporal information in latent space for data augmentation. This approach effectively integrates NODEs with SSL, offering significant performance improvements compared to traditional methods that lack explicit temporal integration. We demonstrate the effectiveness of our strategy for diabetic retinopathy progression prediction using the OPHDIAT database. Compared to the baseline, all NODE architectures achieve statistically significant improvements in area under the ROC curve (AUC) and Kappa metrics, highlighting the efficacy of pre-training with SSL-inspired approaches. Additionally, our framework promotes stable training for NODEs, a commonly encountered challenge in time-aware modeling.
We exploit the similarities between Tikhonov regularization and Bayesian hierarchical models to propose a regularization scheme that acts like a distributed Tikhonov regularization where the amount of regularization varies from component to component. In the standard formulation, Tikhonov regularization compensates for the inherent ill-conditioning of linear inverse problems by augmenting the data fidelity term measuring the mismatch between the data and the model output with a scaled penalty functional. The selection of the scaling is the core problem in Tikhonov regularization. If an estimate of the amount of noise in the data is available, a popular way is to use the Morozov discrepancy principle, stating that the scaling parameter should be chosen so as to guarantee that the norm of the data fitting error is approximately equal to the norm of the noise in the data. A too small value of the regularization parameter would yield a solution that fits to the noise while a too large value would lead to an excessive penalization of the solution. In many applications, it would be preferable to apply distributed regularization, replacing the regularization scalar by a vector valued parameter, allowing different regularization for different components of the unknown, or for groups of them. A distributed Tikhonov-inspired regularization is particularly well suited when the data have significantly different sensitivity to different components, or to promote sparsity of the solution. The numerical scheme that we propose, while exploiting the Bayesian interpretation of the inverse problem and identifying the Tikhonov regularization with the Maximum A Posteriori (MAP) estimation, requires no statistical tools. A combination of numerical linear algebra and optimization tools makes the scheme computationally efficient and suitable for problems where the matrix is not explicitly available.
Solving partial differential equations (PDEs) on fine spatio-temporal scales for high-fidelity solutions is critical for numerous scientific breakthroughs. Yet, this process can be prohibitively expensive, owing to the inherent complexities of the problems, including nonlinearity and multiscale phenomena. To speed up large-scale computations, a process known as downscaling is employed, which generates high-fidelity approximate solutions from their low-fidelity counterparts. In this paper, we propose a novel Physics-Guided Diffusion Model (PGDM) for downscaling. Our model, initially trained on a dataset comprising low-and-high-fidelity paired solutions across coarse and fine scales, generates new high-fidelity approximations from any new low-fidelity inputs. These outputs are subsequently refined through fine-tuning, aimed at minimizing the physical discrepancies as defined by the discretized PDEs at the finer scale. We evaluate and benchmark our model's performance against other downscaling baselines in three categories of nonlinear PDEs. Our numerical experiments demonstrate that our model not only outperforms the baselines but also achieves a computational acceleration exceeding tenfold, while maintaining the same level of accuracy as the conventional fine-scale solvers.
In this work we consider the Allen--Cahn equation, a prototypical model problem in nonlinear dynamics that exhibits bifurcations corresponding to variations of a deterministic bifurcation parameter. Going beyond the state-of-the-art, we introduce a random coefficient function in the linear reaction part of the equation, thereby accounting for random, spatially-heterogeneous effects. Importantly, we assume a spatially constant, deterministic mean value of the random coefficient. We show that this mean value is in fact a bifurcation parameter in the Allen--Cahn equation with random coefficients. Moreover, we show that the bifurcation points and bifurcation curves become random objects. We consider two distinct modelling situations: (i) for a spatially homogeneous coefficient we derive analytical expressions for the distribution of the bifurcation points and show that the bifurcation curves are random shifts of a fixed reference curve; (ii) for a spatially heterogeneous coefficient we employ a generalized polynomial chaos expansion to approximate the statistical properties of the random bifurcation points and bifurcation curves. We present numerical examples in 1D physical space, where we combine the popular software package Continuation Core and Toolboxes (CoCo) for numerical continuation and the Sparse Grids Matlab Kit for the polynomial chaos expansion. Our exposition addresses both, dynamical systems and uncertainty quantification, highlighting how analytical and numerical tools from both areas can be combined efficiently for the challenging uncertainty quantification analysis of bifurcations in random differential equations.
Hybrid quantum-classical computing in the noisy intermediate-scale quantum (NISQ) era with variational algorithms can exhibit barren plateau issues, causing difficult convergence of gradient-based optimization techniques. In this paper, we discuss "post-variational strategies", which shift tunable parameters from the quantum computer to the classical computer, opting for ensemble strategies when optimizing quantum models. We discuss various strategies and design principles for constructing individual quantum circuits, where the resulting ensembles can be optimized with convex programming. Further, we discuss architectural designs of post-variational quantum neural networks and analyze the propagation of estimation errors throughout such neural networks. Finally, we show that empirically, post-variational quantum neural networks using our architectural designs can potentially provide better results than variational algorithms and performance comparable to that of two-layer neural networks.
Learning approximations to smooth target functions of many variables from finite sets of pointwise samples is an important task in scientific computing and its many applications in computational science and engineering. Despite well over half a century of research on high-dimensional approximation, this remains a challenging problem. Yet, significant advances have been made in the last decade towards efficient methods for doing this, commencing with so-called sparse polynomial approximation methods and continuing most recently with methods based on Deep Neural Networks (DNNs). In tandem, there have been substantial advances in the relevant approximation theory and analysis of these techniques. In this work, we survey this recent progress. We describe the contemporary motivations for this problem, which stem from parametric models and computational uncertainty quantification; the relevant function classes, namely, classes of infinite-dimensional, Banach-valued, holomorphic functions; fundamental limits of learnability from finite data for these classes; and finally, sparse polynomial and DNN methods for efficiently learning such functions from finite data. For the latter, there is currently a significant gap between the approximation theory of DNNs and the practical performance of deep learning. Aiming to narrow this gap, we develop the topic of practical existence theory, which asserts the existence of dimension-independent DNN architectures and training strategies that achieve provably near-optimal generalization errors in terms of the amount of training data.
Bilevel optimization, with broad applications in machine learning, has an intricate hierarchical structure. Gradient-based methods have emerged as a common approach to large-scale bilevel problems. However, the computation of the hyper-gradient, which involves a Hessian inverse vector product, confines the efficiency and is regarded as a bottleneck. To circumvent the inverse, we construct a sequence of low-dimensional approximate Krylov subspaces with the aid of the Lanczos process. As a result, the constructed subspace is able to dynamically and incrementally approximate the Hessian inverse vector product with less effort and thus leads to a favorable estimate of the hyper-gradient. Moreover, we propose a~provable subspace-based framework for bilevel problems where one central step is to solve a small-size tridiagonal linear system. To the best of our knowledge, this is the first time that subspace techniques are incorporated into bilevel optimization. This successful trial not only enjoys $\mathcal{O}(\epsilon^{-1})$ convergence rate but also demonstrates efficiency in a synthetic problem and two deep learning tasks.
We propose a method for obtaining parsimonious decompositions of networks into higher order interactions which can take the form of arbitrary motifs.The method is based on a class of analytically solvable generative models, where vertices are connected via explicit copies of motifs, which in combination with non-parametric priors allow us to infer higher order interactions from dyadic graph data without any prior knowledge on the types or frequencies of such interactions. Crucially, we also consider 'degree--corrected' models that correctly reflect the degree distribution of the network and consequently prove to be a better fit for many real world--networks compared to non-degree corrected models. We test the presented approach on simulated data for which we recover the set of underlying higher order interactions to a high degree of accuracy. For empirical networks the method identifies concise sets of atomic subgraphs from within thousands of candidates that cover a large fraction of edges and include higher order interactions of known structural and functional significance. The method not only produces an explicit higher order representation of the network but also a fit of the network to analytically tractable models opening new avenues for the systematic study of higher order network structures.
We apply a generalized piecewise-linear (PL) version of Morse theory due to Grunert-Kuhnel-Rote to define and study new local and global notions of topological complexity for fully-connected feedforward ReLU neural network functions, F: R^n -> R. Along the way, we show how to construct, for each such F, a canonical polytopal complex K(F) and a deformation retract of the domain onto K(F), yielding a convenient compact model for performing calculations. We also give a construction showing that local complexity can be arbitrarily high.
Conventional machine learning algorithms have traditionally been designed under the assumption that input data follows a vector-based format, with an emphasis on vector-centric paradigms. However, as the demand for tasks involving set-based inputs has grown, there has been a paradigm shift in the research community towards addressing these challenges. In recent years, the emergence of neural network architectures such as Deep Sets and Transformers has presented a significant advancement in the treatment of set-based data. These architectures are specifically engineered to naturally accommodate sets as input, enabling more effective representation and processing of set structures. Consequently, there has been a surge of research endeavors dedicated to exploring and harnessing the capabilities of these architectures for various tasks involving the approximation of set functions. This comprehensive survey aims to provide an overview of the diverse problem settings and ongoing research efforts pertaining to neural networks that approximate set functions. By delving into the intricacies of these approaches and elucidating the associated challenges, the survey aims to equip readers with a comprehensive understanding of the field. Through this comprehensive perspective, we hope that researchers can gain valuable insights into the potential applications, inherent limitations, and future directions of set-based neural networks. Indeed, from this survey we gain two insights: i) Deep Sets and its variants can be generalized by differences in the aggregation function, and ii) the behavior of Deep Sets is sensitive to the choice of the aggregation function. From these observations, we show that Deep Sets, one of the well-known permutation-invariant neural networks, can be generalized in the sense of a quasi-arithmetic mean.