Non-linear dynamical systems can be handily described by the associated Koopman operator, whose action evolves every observable of the system forward in time. Learning the Koopman operator from data is enabled by a number of algorithms. In this work we present nonasymptotic learning bounds for the Koopman eigenvalues and eigenfunctions estimated by two popular algorithms: Extended Dynamic Mode Decomposition (EDMD) and Reduced Rank Regression (RRR). We focus on time-reversal-invariant Markov chains, implying that the Koopman operator is self-adjoint. This includes important examples of stochastic dynamical systems, notably Langevin dynamics. Our spectral learning bounds are driven by the simultaneous control of the operator norm risk of the estimators and a metric distortion associated to the corresponding eigenfunctions. Our analysis indicates that both algorithms have similar variance, but EDMD suffers from a larger bias which might be detrimental to its learning rate. We further argue that a large metric distortion may lead to spurious eigenvalues, a phenomenon which has been empirically observed, and note that metric distortion can be estimated from data. Numerical experiments complement the theoretical findings.
Neural operators are gaining attention in computational science and engineering. PCA-Net is a recently proposed neural operator architecture which combines principal component analysis (PCA) with neural networks to approximate an underlying operator. The present work develops approximation theory for this approach, improving and significantly extending previous work in this direction. In terms of qualitative bounds, this paper derives a novel universal approximation result, under minimal assumptions on the underlying operator and the data-generating distribution. In terms of quantitative bounds, two potential obstacles to efficient operator learning with PCA-Net are identified, and made rigorous through the derivation of lower complexity bounds; the first relates to the complexity of the output distribution, measured by a slow decay of the PCA eigenvalues. The other obstacle relates the inherent complexity of the space of operators between infinite-dimensional input and output spaces, resulting in a rigorous and quantifiable statement of the curse of dimensionality. In addition to these lower bounds, upper complexity bounds are derived; first, a suitable smoothness criterion is shown to ensure a algebraic decay of the PCA eigenvalues. Then, it is shown that PCA-Net can overcome the general curse of dimensionality for specific operators of interest, arising from the Darcy flow and Navier-Stokes equations.
Machine learned partial differential equation (PDE) solvers trade the reliability of standard numerical methods for potential gains in accuracy and/or speed. The only way for a solver to guarantee that it outputs the exact solution is to use a convergent method in the limit that the grid spacing $\Delta x$ and timestep $\Delta t$ approach zero. Machine learned solvers, which learn to update the solution at large $\Delta x$ and/or $\Delta t$, can never guarantee perfect accuracy. Some amount of error is inevitable, so the question becomes: how do we constrain machine learned solvers to give us the sorts of errors that we are willing to tolerate? In this paper, we design more reliable machine learned PDE solvers by preserving discrete analogues of the continuous invariants of the underlying PDE. Examples of such invariants include conservation of mass, conservation of energy, the second law of thermodynamics, and/or non-negative density. Our key insight is simple: to preserve invariants, at each timestep apply an error-correcting algorithm to the update rule. Though this strategy is different from how standard solvers preserve invariants, it is necessary to retain the flexibility that allows machine learned solvers to be accurate at large $\Delta x$ and/or $\Delta t$. This strategy can be applied to any autoregressive solver for any time-dependent PDE in arbitrary geometries with arbitrary boundary conditions. Although this strategy is very general, the specific error-correcting algorithms need to be tailored to the invariants of the underlying equations as well as to the solution representation and time-stepping scheme of the solver. The error-correcting algorithms we introduce have two key properties. First, by preserving the right invariants they guarantee numerical stability. Second, in closed or periodic systems they do so without degrading the accuracy of an already-accurate solver.
In this work, we give sufficient conditions for the almost global asymptotic stability of a cascade in which the inner loop and the unforced outer loop are each almost globally asymptotically stable. Our qualitative approach relies on the absence of chain recurrence for non-equilibrium points of the unforced outer loop, the hyperbolicity of equilibria, and the precompactness of forward trajectories. We show that the required structure of the chain recurrent set can be readily verified, and describe two important classes of systems with this property. We also show that the precompactness requirement can be verified by growth rate conditions on the interconnection term coupling the subsystems. Our results stand in contrast to prior works that require either global asymptotic stability of the subsystems (impossible for smooth systems evolving on general manifolds), time scale separation between the subsystems, or strong disturbance robustness properties of the outer loop. The approach has clear applications in stability certification of cascaded controllers for systems evolving on manifolds.
Bilevel optimization has been applied to a wide variety of machine learning models, and numerous stochastic bilevel optimization algorithms have been developed in recent years. However, most existing algorithms restrict their focus on the single-machine setting so that they are incapable of handling the distributed data. To address this issue, under the setting where all participants compose a network and perform peer-to-peer communication in this network, we developed two novel decentralized stochastic bilevel optimization algorithms based on the gradient tracking communication mechanism and two different gradient estimators. Additionally, we established their convergence rates for nonconvex-strongly-convex problems with novel theoretical analysis strategies. To our knowledge, this is the first work achieving these theoretical results. Finally, we applied our algorithms to practical machine learning models, and the experimental results confirmed the efficacy of our algorithms.
The Koopman operator allows a nonlinear system to be rewritten as an infinite-dimensional linear system by viewing it in terms of an infinite set of lifting functions instead of a state vector. The main feature of this representation is its linearity, making it compatible with existing linear systems theory. A finite-dimensional approximation of the Koopman operator can be identified from experimental data by choosing a finite subset of lifting functions, applying it to the data, and solving a least squares problem in the lifted space. Existing Koopman operator approximation methods are designed to identify open-loop systems. However, it is impractical or impossible to run experiments on some systems without a feedback controller. Unfortunately, the introduction of feedback control results in correlations between the system's input and output, making some plant dynamics difficult to identify if the controller is neglected. This paper addresses this limitation by introducing a method to identify a Koopman model of the closed-loop system, and then extract a Koopman model of the plant given knowledge of the controller. This is accomplished by leveraging the linearity of the Koopman representation of the system. The proposed approach widens the applicability of Koopman operator identification methods to a broader class of systems. The effectiveness of the proposed closed-loop Koopman operator approximation method is demonstrated experimentally using a Harmonic Drive gearbox exhibiting nonlinear vibrations.
We study the convergence of a family of numerical integration methods where the numerical integral is formulated as a finite matrix approximation to a multiplication operator. For bounded functions, the convergence has already been established using the theory of strong operator convergence. In this article, we consider unbounded functions and domains which pose several difficulties compared to the bounded case. A natural choice of method for this study is the theory of strong resolvent convergence which has previously been mostly applied to study the convergence of approximations of differential operators. The existing theory already includes convergence theorems that can be used as proofs as such for a limited class of functions and extended for wider class of functions in terms of function growth or discontinuity. The extended results apply to all self-adjoint operators, not just multiplication operators. We also show how Jensen's operator inequality can be used to analyse the convergence of an improper numerical integral of a function bounded by an operator convex function.
Estimands can help clarify the interpretation of treatment effects and ensure that estimators are aligned to the study's objectives. Cluster randomised trials require additional attributes to be defined within the estimand compared to individually randomised trials, including whether treatment effects are marginal or cluster specific, and whether they are participant or cluster average. In this paper, we provide formal definitions of estimands encompassing both these attributes using potential outcomes notation and describe differences between them. We then provide an overview of estimators for each estimand that are asymptotically unbiased under minimal assumptions. Then, through a reanalysis of a published cluster randomised trial, we demonstrate that estimates corresponding to the different estimands can vary considerably. Estimated odds ratios corresponding to different estimands varied by more than 30 percent, from 3.69 to 4.85. We conclude that careful specification of the estimand, along with appropriate choice of estimator, are essential to ensuring that cluster randomised trials are addressing the right question.
Classic algorithms and machine learning systems like neural networks are both abundant in everyday life. While classic computer science algorithms are suitable for precise execution of exactly defined tasks such as finding the shortest path in a large graph, neural networks allow learning from data to predict the most likely answer in more complex tasks such as image classification, which cannot be reduced to an exact algorithm. To get the best of both worlds, this thesis explores combining both concepts leading to more robust, better performing, more interpretable, more computationally efficient, and more data efficient architectures. The thesis formalizes the idea of algorithmic supervision, which allows a neural network to learn from or in conjunction with an algorithm. When integrating an algorithm into a neural architecture, it is important that the algorithm is differentiable such that the architecture can be trained end-to-end and gradients can be propagated back through the algorithm in a meaningful way. To make algorithms differentiable, this thesis proposes a general method for continuously relaxing algorithms by perturbing variables and approximating the expectation value in closed form, i.e., without sampling. In addition, this thesis proposes differentiable algorithms, such as differentiable sorting networks, differentiable renderers, and differentiable logic gate networks. Finally, this thesis presents alternative training strategies for learning with algorithms.
The rapid recent progress in machine learning (ML) has raised a number of scientific questions that challenge the longstanding dogma of the field. One of the most important riddles is the good empirical generalization of overparameterized models. Overparameterized models are excessively complex with respect to the size of the training dataset, which results in them perfectly fitting (i.e., interpolating) the training data, which is usually noisy. Such interpolation of noisy data is traditionally associated with detrimental overfitting, and yet a wide range of interpolating models -- from simple linear models to deep neural networks -- have recently been observed to generalize extremely well on fresh test data. Indeed, the recently discovered double descent phenomenon has revealed that highly overparameterized models often improve over the best underparameterized model in test performance. Understanding learning in this overparameterized regime requires new theory and foundational empirical studies, even for the simplest case of the linear model. The underpinnings of this understanding have been laid in very recent analyses of overparameterized linear regression and related statistical learning tasks, which resulted in precise analytic characterizations of double descent. This paper provides a succinct overview of this emerging theory of overparameterized ML (henceforth abbreviated as TOPML) that explains these recent findings through a statistical signal processing perspective. We emphasize the unique aspects that define the TOPML research area as a subfield of modern ML theory and outline interesting open questions that remain.
The area of Data Analytics on graphs promises a paradigm shift as we approach information processing of classes of data, which are typically acquired on irregular but structured domains (social networks, various ad-hoc sensor networks). Yet, despite its long history, current approaches mostly focus on the optimization of graphs themselves, rather than on directly inferring learning strategies, such as detection, estimation, statistical and probabilistic inference, clustering and separation from signals and data acquired on graphs. To fill this void, we first revisit graph topologies from a Data Analytics point of view, and establish a taxonomy of graph networks through a linear algebraic formalism of graph topology (vertices, connections, directivity). This serves as a basis for spectral analysis of graphs, whereby the eigenvalues and eigenvectors of graph Laplacian and adjacency matrices are shown to convey physical meaning related to both graph topology and higher-order graph properties, such as cuts, walks, paths, and neighborhoods. Next, to illustrate estimation strategies performed on graph signals, spectral analysis of graphs is introduced through eigenanalysis of mathematical descriptors of graphs and in a generic way. Finally, a framework for vertex clustering and graph segmentation is established based on graph spectral representation (eigenanalysis) which illustrates the power of graphs in various data association tasks. The supporting examples demonstrate the promise of Graph Data Analytics in modeling structural and functional/semantic inferences. At the same time, Part I serves as a basis for Part II and Part III which deal with theory, methods and applications of processing Data on Graphs and Graph Topology Learning from data.