In this article, we study curvature-like feature value of data sets in Euclidean spaces. First, we formulate such curvature functions with desirable properties under the manifold hypothesis. Then we make a test property for the validity of the curvature function by the law of large numbers, and check it for the function we construct by numerical experiments. These experiments also suggest the conjecture that the mean of the curvature of sample manifolds coincides with the curvature of the mean manifold. Our construction is based on the dimension estimation by the principal component analysis and the Gaussian curvature of hypersurfaces. Our function depends on provisional parameters $\varepsilon, \delta$, and we suggest dealing with the resulting functions as a function of these parameters to get some robustness. As an application, we propose a method to decompose data sets into some parts reflecting local structure. For this, we embed the data sets into higher dimensional Euclidean space using curvature values and cluster them in the embedding space. We also give some computational experiments that support the effectiveness of our methods.
Given its status as a classic problem and its importance to both theoreticians and practitioners, edit distance provides an excellent lens through which to understand how the theoretical analysis of algorithms impacts practical implementations. From an applied perspective, the goals of theoretical analysis are to predict the empirical performance of an algorithm and to serve as a yardstick to design novel algorithms that perform well in practice. In this paper, we systematically survey the types of theoretical analysis techniques that have been applied to edit distance and evaluate the extent to which each one has achieved these two goals. These techniques include traditional worst-case analysis, worst-case analysis parametrized by edit distance or entropy or compressibility, average-case analysis, semi-random models, and advice-based models. We find that the track record is mixed. On one hand, two algorithms widely used in practice have been born out of theoretical analysis and their empirical performance is captured well by theoretical predictions. On the other hand, all the algorithms developed using theoretical analysis as a yardstick since then have not had any practical relevance. We conclude by discussing the remaining open problems and how they can be tackled.
This work proposes a model-reduction approach for the material point method on nonlinear manifolds. Our technique approximates the $\textit{kinematics}$ by approximating the deformation map using an implicit neural representation that restricts deformation trajectories to reside on a low-dimensional manifold. By explicitly approximating the deformation map, its spatiotemporal gradients -- in particular the deformation gradient and the velocity -- can be computed via analytical differentiation. In contrast to typical model-reduction techniques that construct a linear or nonlinear manifold to approximate the (finite number of) degrees of freedom characterizing a given spatial discretization, the use of an implicit neural representation enables the proposed method to approximate the $\textit{continuous}$ deformation map. This allows the kinematic approximation to remain agnostic to the discretization. Consequently, the technique supports dynamic discretizations -- including resolution changes -- during the course of the online reduced-order-model simulation. To generate $\textit{dynamics}$ for the generalized coordinates, we propose a family of projection techniques. At each time step, these techniques: (1) Calculate full-space kinematics at quadrature points, (2) Calculate the full-space dynamics for a subset of `sample' material points, and (3) Calculate the reduced-space dynamics by projecting the updated full-space position and velocity onto the low-dimensional manifold and tangent space, respectively. We achieve significant computational speedup via hyper-reduction that ensures all three steps execute on only a small subset of the problem's spatial domain. Large-scale numerical examples with millions of material points illustrate the method's ability to gain an order of magnitude computational-cost saving -- indeed $\textit{real-time simulations}$ -- with negligible errors.
Static Analysis tools have rules for several code quality issues and these rules are created by experts manually. In this paper, we address the problem of automatic synthesis of code quality rules from examples. We formulate the rule synthesis problem as synthesizing first order logic formulas over graph representations of code. We present a new synthesis algorithm RhoSynth that is based on Integer Linear Programming-based graph alignment for identifying code elements of interest to the rule. We bootstrap RhoSynth by leveraging code changes made by developers as the source of positive and negative examples. We also address rule refinement in which the rules are incrementally improved with additional user-provided examples. We validate RhoSynth by synthesizing more than 30 Java code quality rules. These rules have been deployed as part of a code review system in a company and their precision exceeds 75% based on developer feedback collected during live code-reviews. Through comparisons with recent baselines, we show that current state-of-the-art program synthesis approaches are unable to synthesize most of these rules.
We provide a decision theoretic analysis of bandit experiments. The setting corresponds to a dynamic programming problem, but solving this directly is typically infeasible. Working within the framework of diffusion asymptotics, we define suitable notions of asymptotic Bayes and minimax risk for bandit experiments. For normally distributed rewards, the minimal Bayes risk can be characterized as the solution to a nonlinear second-order partial differential equation (PDE). Using a limit of experiments approach, we show that this PDE characterization also holds asymptotically under both parametric and non-parametric distribution of the rewards. The approach further describes the state variables it is asymptotically sufficient to restrict attention to, and therefore suggests a practical strategy for dimension reduction. The upshot is that we can approximate the dynamic programming problem defining the bandit experiment with a PDE which can be efficiently solved using sparse matrix routines. We derive the optimal Bayes and minimax policies from the numerical solutions to these equations. The proposed policies substantially dominate existing methods such as Thompson sampling. The framework also allows for substantial generalizations to the bandit problem such as time discounting and pure exploration motives.
Multigrid is a powerful solver for large-scale linear systems arising from discretized partial differential equations. The convergence theory of multigrid methods for symmetric positive definite problems has been well developed over the past decades, while, for nonsymmetric problems, such theory is still not mature. As a foundation for multigrid analysis, two-grid convergence theory plays an important role in motivating multigrid algorithms. Regarding two-grid methods for nonsymmetric problems, most previous works focus on the spectral radius of iteration matrix or rely on convergence measures that are typically difficult to compute in practice. Moreover, the existing results are confined to two-grid methods with exact solution of the coarse-grid system. In this paper, we analyze the convergence of a two-grid method for nonsymmetric positive definite problems (e.g., linear systems arising from the discretizations of convection-diffusion equations). In the case of exact coarse solver, we establish an elegant identity for characterizing two-grid convergence factor, which is measured by a smoother-induced norm. The identity can be conveniently used to derive a class of optimal restriction operators and analyze how the convergence factor is influenced by restriction. More generally, we present some convergence estimates for an inexact variant of the two-grid method, in which both linear and nonlinear coarse solvers are considered.
An important challenge in statistical analysis lies in controlling the estimation bias when handling the ever-increasing data size and model complexity. For example, approximate methods are increasingly used to address the analytical and/or computational challenges when implementing standard estimators, but they often lead to inconsistent estimators. So consistent estimators can be difficult to obtain, especially for complex models and/or in settings where the number of parameters diverges with the sample size. We propose a general simulation-based estimation framework that allows to construct consistent and bias corrected estimators for parameters of increasing dimensions. The key advantage of the proposed framework is that it only requires to compute a simple inconsistent estimator multiple times. The resulting Just Identified iNdirect Inference estimator (JINI) enjoys nice properties, including consistency, asymptotic normality, and finite sample bias correction better than alternative methods. We further provide a simple algorithm to construct the JINI in a computationally efficient manner. Therefore, the JINI is especially useful in settings where standard methods may be challenging to apply, for example, in the presence of misclassification and rounding. We consider comprehensive simulation studies and analyze an alcohol consumption data example to illustrate the excellent performance and usefulness of the method.
Many forms of dependence manifest themselves over time, with behavior of variables in dynamical systems as a paradigmatic example. This paper studies temporal dependence in dynamical systems from a logical perspective, by extending a minimal modal base logic of static functional dependencies. We define a logic for dynamical systems with single time steps, provide a complete axiomatic proof calculus, and show the decidability of the satisfiability problem for a substantial fragment. The system comes in two guises: modal and first-order, that naturally complement each other. Next, we consider a timed semantics for our logic, as an intermediate between state spaces and temporal universes for the unfoldings of a dynamical system. We prove completeness and decidability by combining techniques from dynamic-epistemic logic and modal logic of functional dependencies with complex terms for objects. Also, we extend these results to the timed logic with functional symbols and term identity. Finally, we conclude with a brief outlook on how the system proposed here connects with richer temporal logics of system behavior, and with dynamic topological logic.
One of the most important problems in system identification and statistics is how to estimate the unknown parameters of a given model. Optimization methods and specialized procedures, such as Empirical Minimization (EM) can be used in case the likelihood function can be computed. For situations where one can only simulate from a parametric model, but the likelihood is difficult or impossible to evaluate, a technique known as the Two-Stage (TS) Approach can be applied to obtain reliable parametric estimates. Unfortunately, there is currently a lack of theoretical justification for TS. In this paper, we propose a statistical decision-theoretical derivation of TS, which leads to Bayesian and Minimax estimators. We also show how to apply the TS approach on models for independent and identically distributed samples, by computing quantiles of the data as a first step, and using a linear function as the second stage. The proposed method is illustrated via numerical simulations.
Multi-fidelity models are of great importance due to their capability of fusing information coming from different simulations and sensors. In the context of Gaussian process regression we can exploit low-fidelity models to better capture the latent manifold thus improving the accuracy of the model. We focus on the approximation of high-dimensional scalar functions with low intrinsic dimensionality. By introducing a low dimensional bias in a chain of Gaussian processes with different fidelities we can fight the curse of dimensionality affecting these kind of quantities of interest, especially for many-query applications. In particular we seek a gradient-based reduction of the parameter space through linear active subspaces or a nonlinear transformation of the input space. Then we build a low-fidelity response surface based on such reduction, thus enabling multi-fidelity Gaussian process regression without the need of running new simulations with simplified physical models. This has a great potential in the data scarcity regime affecting many engineering applications. In this work we present a new multi-fidelity approach -- starting from the preliminary analysis conducted in Romor et al. 2020 -- involving active subspaces and nonlinear level-set learning method. The proposed numerical method is tested on two high-dimensional benchmark functions, and on a more complex car aerodynamics problem. We show how a low intrinsic dimensionality bias can increase the accuracy of Gaussian process response surfaces.
Models for dependent data are distinguished by their targets of inference. Marginal models are useful when interest lies in quantifying associations averaged across a population of clusters. When the functional form of a covariate-outcome association is unknown, flexible regression methods are needed to allow for potentially non-linear relationships. We propose a novel marginal additive model (MAM) for modelling cluster-correlated data with non-linear population-averaged associations. The proposed MAM is a unified framework for estimation and uncertainty quantification of a marginal mean model, combined with inference for between-cluster variability and cluster-specific prediction. We propose a fitting algorithm that enables efficient computation of standard errors and corrects for estimation of penalty terms. We demonstrate the proposed methods in simulations and in application to (i) a longitudinal study of beaver foraging behaviour, and (ii) a spatial analysis of Loaloa infection in West Africa. R code for implementing the proposed methodology is available at //github.com/awstringer1/mam.