Ordinary differential equation models are used to describe dynamic processes across biology. To perform likelihood-based parameter inference on these models, it is necessary to specify a statistical process representing the contribution of factors not explicitly included in the mathematical model. For this, independent Gaussian noise is commonly chosen, with its use so widespread that researchers typically provide no explicit justification for this choice. This noise model assumes `random' latent factors affect the system in ephemeral fashion resulting in unsystematic deviation of observables from their modelled counterparts. However, like the deterministically modelled parts of a system, these latent factors can have persistent effects on observables. Here, we use experimental data from dynamical systems drawn from cardiac physiology and electrochemistry to demonstrate that highly persistent differences between observations and modelled quantities can occur. Considering the case when persistent noise arises due only to measurement imperfections, we use the Fisher information matrix to quantify how uncertainty in parameter estimates is artificially reduced when erroneously assuming independent noise. We present a workflow to diagnose persistent noise from model fits and describe how to remodel accounting for correlated errors.
The US Census Bureau will deliberately corrupt data sets derived from the 2020 US Census in an effort to maintain privacy, suggesting a painful trade-off between the privacy of respondents and the precision of economic analysis. To investigate whether this trade-off is inevitable, we formulate a semiparametric model of causal inference with high dimensional corrupted data. We propose a procedure for data cleaning, estimation, and inference with data cleaning-adjusted confidence intervals. We prove consistency, Gaussian approximation, and semiparametric efficiency by finite sample arguments, with a rate of $n^{-1/2}$ for semiparametric estimands that degrades gracefully for nonparametric estimands. Our key assumption is that the true covariates are approximately low rank, which we interpret as approximate repeated measurements and validate in the Census. In our analysis, we provide nonasymptotic theoretical contributions to matrix completion, statistical learning, and semiparametric statistics. Calibrated simulations verify the coverage of our data cleaning-adjusted confidence intervals and demonstrate the relevance of our results for 2020 Census data.
We present a robust framework to perform linear regression with missing entries in the features. By considering an elliptical data distribution, and specifically a multivariate normal model, we are able to conditionally formulate a distribution for the missing entries and present a robust framework, which minimizes the worst case error caused by the uncertainty about the missing data. We show that the proposed formulation, which naturally takes into account the dependency between different variables, ultimately reduces to a convex program, for which a customized and scalable solver can be delivered. In addition to a detailed analysis to deliver such solver, we also asymptoticly analyze the behavior of the proposed framework, and present technical discussions to estimate the required input parameters. We complement our analysis with experiments performed on synthetic, semi-synthetic, and real data, and show how the proposed formulation improves the prediction accuracy and robustness, and outperforms the competing techniques. Missing data is a common problem associated with many datasets in machine learning. With the significant increase in using robust optimization techniques to train machine learning models, this paper presents a novel robust regression framework that operates by minimizing the uncertainty associated with missing data. The proposed approach allows training models with incomplete data, while minimizing the impact of uncertainty associated with the unavailable data. The ideas developed in this paper can be generalized beyond linear models and elliptical data distributions.
In this paper, we develop a gradient recovery based linear (GRBL) finite element method (FEM) and a Hessian recovery based linear (HRBL) FEM for second order elliptic equations in non-divergence form. The elliptic equation is casted into a symmetric non-divergence weak formulation, in which second order derivatives of the unknown function are involved. We use gradient and Hessian recovery operators to calculate the second order derivatives of linear finite element approximations. Although, thanks to low degrees of freedom (DOF) of linear elements, the implementation of the proposed schemes is easy and straightforward, the performances of the methods are competitive. The unique solvability and the $H^2$ seminorm error estimate of the GRBL scheme are rigorously proved. Optimal error estimates in both the $L^2$ norm and the $H^1$ seminorm have been proved when the coefficient is diagonal, which have been confirmed by numerical experiments. Superconvergence in errors has also been observed. Moreover, our methods can handle computational domains with curved boundaries without loss of accuracy from approximation of boundaries. Finally, the proposed numerical methods have been successfully applied to solve fully nonlinear Monge-Amp\`{e}re equations.
Upcoming astronomical surveys will observe billions of galaxies across cosmic time, providing a unique opportunity to map the many pathways of galaxy assembly to an incredibly high resolution. However, the huge amount of data also poses an immediate computational challenge: current tools for inferring parameters from the light of galaxies take $\gtrsim 10$ hours per fit. This is prohibitively expensive. Simulation-based Inference (SBI) is a promising solution. However, it requires simulated data with identical characteristics to the observed data, whereas real astronomical surveys are often highly heterogeneous, with missing observations and variable uncertainties determined by sky and telescope conditions. Here we present a Monte Carlo technique for treating out-of-distribution measurement errors and missing data using standard SBI tools. We show that out-of-distribution measurement errors can be approximated by using standard SBI evaluations, and that missing data can be marginalized over using SBI evaluations over nearby data realizations in the training set. While these techniques slow the inference process from $\sim 1$ sec to $\sim 1.5$ min per object, this is still significantly faster than standard approaches while also dramatically expanding the applicability of SBI. This expanded regime has broad implications for future applications to astronomical surveys.
Despite the progress in medical data collection the actual burden of SARS-CoV-2 remains unknown due to severe under-ascertainment of cases. The use of reported deaths has been pointed out as a more reliable source of information, likely less prone to under-reporting. Given that daily deaths occur from past infections weighted by their probability of death, one may infer the true number of infections accounting for their age distribution, using the data on reported deaths. We adopt this framework and assume that the dynamics generating the total number of infections can be described by a continuous time transmission model expressed through a system of non-linear ordinary differential equations where the transmission rate is modelled as a diffusion process allowing to reveal both the effect of control strategies and the changes in individuals behavior. We study the case of 6 European countries and estimate the time-varying reproduction number($R_t$) as well as the true cumulative number of infected individuals using Stan. As we estimate the true number of infections we offer a more accurate estimate of $R_t$. We also provide an estimate of the daily reporting ratio and discuss the effects of changes in mobility and testing on the inferred quantities.
In this paper, we propose a new approach for the time-discretization of the incompressible stochastic Stokes equations with multiplicative noise. Our new strategy is based on the classical Milstein method from stochastic differential equations. We use the energy method for its error analysis and show a strong convergence order of at most $1$ for both velocity and pressure approximations. The proof is based on a new H\"older continuity estimate of the velocity solution. While the errors of the velocity approximation are estimated in the standard $L^2$- and $H^1$-norms, the pressure errors are carefully analyzed in a special norm because of the low regularity of the pressure solution. In addition, a new interpretation of the pressure solution, which is very useful in computation, is also introduced. Numerical experiments are also provided to validate the error estimates and their sharpness.
Effect size measures and visualization techniques aimed at maximizing the interpretability and comparability of results from statistical models have long been of great importance and are recently again receiving increased attention in the literature. However, since the methods proposed in this context originate from a wide variety of disciplines and are more often than not practically motivated, they lack a common theoretical framework and many quantities are narrowly or heuristically defined. In this work, we put forward a common mathematical setting for effect size measures and visualization techniques aimed at the results of parametric regression and define a formal framework for the consistent derivation of both existing and new variants of such quantities. Throughout the presented theory, we utilize probability measures to derive weighted means over areas of interest. While we take a Bayesian approach to quantifying uncertainty in order to derive consistent results for every defined quantity, all proposed methods apply to the results of both frequentist and Bayesian inference. We apply selected specifications derived from the proposed framework to data from a clinical trial and a multi-analyst study to illustrate its versatility and relevance.
This paper proposes a Sieve Simulated Method of Moments (Sieve-SMM) estimator for the parameters and the distribution of the shocks in nonlinear dynamic models where the likelihood and the moments are not tractable. An important concern with SMM, which matches sample with simulated moments, is that a parametric distribution is required. However, economic quantities that depend on this distribution, such as welfare and asset-prices, can be sensitive to misspecification. The Sieve-SMM estimator addresses this issue by flexibly approximating the distribution of the shocks with a Gaussian and tails mixture sieve. The asymptotic framework provides consistency, rate of convergence and asymptotic normality results, extending existing results to a new framework with more general dynamics and latent variables. An application to asset pricing in a production economy shows a large decline in the estimates of relative risk-aversion, highlighting the empirical relevance of misspecification bias.
Existing recommender systems extract the user preference based on learning the correlation in data, such as behavioral correlation in collaborative filtering, feature-feature, or feature-behavior correlation in click-through rate prediction. However, regretfully, the real world is driven by causality rather than correlation, and correlation does not imply causation. For example, the recommender systems can recommend a battery charger to a user after buying a phone, in which the latter can serve as the cause of the former, and such a causal relation cannot be reversed. Recently, to address it, researchers in recommender systems have begun to utilize causal inference to extract causality, enhancing the recommender system. In this survey, we comprehensively review the literature on causal inference-based recommendation. At first, we present the fundamental concepts of both recommendation and causal inference as the basis of later content. We raise the typical issues that the non-causality recommendation is faced. Afterward, we comprehensively review the existing work of causal inference-based recommendation, based on a taxonomy of what kind of problem causal inference addresses. Last, we discuss the open problems in this important research area, along with interesting future works.
Large-scale pre-trained models (PTMs) such as BERT and GPT have recently achieved great success and become a milestone in the field of artificial intelligence (AI). Owing to sophisticated pre-training objectives and huge model parameters, large-scale PTMs can effectively capture knowledge from massive labeled and unlabeled data. By storing knowledge into huge parameters and fine-tuning on specific tasks, the rich knowledge implicitly encoded in huge parameters can benefit a variety of downstream tasks, which has been extensively demonstrated via experimental verification and empirical analysis. It is now the consensus of the AI community to adopt PTMs as backbone for downstream tasks rather than learning models from scratch. In this paper, we take a deep look into the history of pre-training, especially its special relation with transfer learning and self-supervised learning, to reveal the crucial position of PTMs in the AI development spectrum. Further, we comprehensively review the latest breakthroughs of PTMs. These breakthroughs are driven by the surge of computational power and the increasing availability of data, towards four important directions: designing effective architectures, utilizing rich contexts, improving computational efficiency, and conducting interpretation and theoretical analysis. Finally, we discuss a series of open problems and research directions of PTMs, and hope our view can inspire and advance the future study of PTMs.