Estimating the expectations of functionals applied to sums of random variables (RVs) is a well-known problem encountered in many challenging applications. Generally, closed-form expressions of these quantities are out of reach. A naive Monte Carlo simulation is an alternative approach. However, this method requires numerous samples for rare event problems. Therefore, it is paramount to use variance reduction techniques to develop fast and efficient estimation methods. In this work, we use importance sampling (IS), known for its efficiency in requiring fewer computations to achieve the same accuracy requirements. We propose a state-dependent IS scheme based on a stochastic optimal control formulation, where the control is dependent on state and time. We aim to calculate rare event quantities that could be written as an expectation of a functional of the sums of independent RVs. The proposed algorithm is generic and can be applied without restrictions on the univariate distributions of RVs or the functional applied to the sum. We apply this approach to the log-normal distribution to compute the left tail and cumulative distribution of the ratio of independent RVs. For each case, we numerically demonstrate that the proposed state-dependent IS algorithm compares favorably to most well-known estimators dealing with similar problems.
In a complex urban environment, due to the unavoidable interruption of GNSS positioning signals and the accumulation of errors during vehicle driving, the collected vehicle trajectory data is likely to be inaccurate and incomplete. A weighted trajectory reconstruction algorithm based on a bidirectional RNN deep network is proposed. GNSS/OBD trajectory acquisition equipment is used to collect vehicle trajectory information, and multi-source data fusion is used to realize bidirectional weighted trajectory reconstruction. At the same time, the neural arithmetic logic unit (NALU) is introduced into the trajectory reconstruction model to strengthen the extrapolation ability of the deep network and ensure the accuracy of trajectory prediction, which can improve the robustness of the algorithm in trajectory reconstruction when dealing with complex urban road sections. The actual urban road section was selected for testing experiments, and a comparative analysis was carried out with existing methods. Through root mean square error (RMSE, root-mean-square error) and using Google Earth to visualize the reconstructed trajectory, the experimental results demonstrate the effectiveness and reliability of the proposed algorithm.
A central goal in experimental high energy physics is to detect new physics signals that are not explained by known physics. In this paper, we aim to search for new signals that appear as deviations from known Standard Model physics in high-dimensional particle physics data. To do this, we determine whether there is any statistically significant difference between the distribution of Standard Model background samples and the distribution of the experimental observations, which are a mixture of the background and a potential new signal. Traditionally, one also assumes access to a sample from a model for the hypothesized signal distribution. Here we instead investigate a model-independent method that does not make any assumptions about the signal and uses a semi-supervised classifier to detect the presence of the signal in the experimental data. We construct three test statistics using the classifier: an estimated likelihood ratio test (LRT) statistic, a test based on the area under the ROC curve (AUC), and a test based on the misclassification error (MCE). Additionally, we propose a method for estimating the signal strength parameter and explore active subspace methods to interpret the proposed semi-supervised classifier in order to understand the properties of the detected signal. We also propose a Score test statistic that can be used in the model-dependent setting. We investigate the performance of the methods on a simulated data set related to the search for the Higgs boson at the Large Hadron Collider at CERN. We demonstrate that the semi-supervised tests have power competitive with the classical supervised methods for a well-specified signal, but much higher power for an unexpected signal which might be entirely missed by the supervised tests.
The performance of decision policies and prediction models often deteriorates when applied to environments different from the ones seen during training. To ensure reliable operation, we propose and analyze the stability of a system under distribution shift, which is defined as the smallest change in the underlying environment that causes the system's performance to deteriorate beyond a permissible threshold. In contrast to standard tail risk measures and distributionally robust losses that require the specification of a plausible magnitude of distribution shift, the stability measure is defined in terms of a more intuitive quantity: the level of acceptable performance degradation. We develop a minimax optimal estimator of stability and analyze its convergence rate, which exhibits a fundamental phase shift behavior. Our characterization of the minimax convergence rate shows that evaluating stability against large performance degradation incurs a statistical cost. Empirically, we demonstrate the practical utility of our stability framework by using it to compare system designs on problems where robustness to distribution shift is critical.
Independent component analysis (ICA) is a blind source separation method to recover source signals of interest from their mixtures. Most existing ICA procedures assume independent sampling. Second-order-statistics-based source separation methods have been developed based on parametric time series models for the mixtures from the autocorrelated sources. However, the second-order-statistics-based methods cannot separate the sources accurately when the sources have temporal autocorrelations with mixed spectra. To address this issue, we propose a new ICA method by estimating spectral density functions and line spectra of the source signals using cubic splines and indicator functions, respectively. The mixed spectra and the mixing matrix are estimated by maximizing the Whittle likelihood function. We illustrate the performance of the proposed method through simulation experiments and an EEG data application. The numerical results indicate that our approach outperforms existing ICA methods, including SOBI algorithms. In addition, we investigate the asymptotic behavior of the proposed method.
Nonlinear Markov Chains (nMC) are regarded as the original (linear) Markov Chains with nonlinear small perturbations. It fits real-world data better, but its associated properties are difficult to describe. A new approach is proposed to analyze the ergodicity and even estimate the convergence bounds of nMC, which is more precise than existing results. In the new method, Coupling Markov about homogeneous Markov chains is applied to reconstitute the relationship between distribution at any times and the limiting distribution. The convergence bounds can be provided by the transition probability matrix of Coupling Markov. Moreover, a new volatility called TV Volatility can be calculated through the convergence bounds, wavelet analysis and Gaussian HMM. It's tested to estimate the volatility of two securities (TSLA and AMC). The results show TV Volatility can reflect the magnitude of the change of square returns in a period wonderfully.
Optimal balance is a non-asymptotic numerical method to compute a point on the slow manifold for certain two-scale dynamical systems. It works by solving a modified version of the system as a boundary value problem in time, where the nonlinear terms are adiabatically ramped up from zero to the fully nonlinear dynamics. A dedicated boundary value solver, however, is often not directly available. The most natural alternative is a nudging solver, where the problem is repeatedly solved forward and backward in time and the respective boundary conditions are restored whenever one of the temporal end points is visited. In this paper, we show quasi-convergence of this scheme in the sense that the termination residual of the nudging iteration is as small as the asymptotic error of the method itself, i.e., under appropriate assumptions exponentially small. This confirms that optimal balance in its nudging formulation is an effective algorithm. Further, it shows that the boundary value problem formulation of optimal balance is well posed up at most a residual error as small as the asymptotic error of the method itself. The key step in our proof is a careful two-component Gronwall inequality.
The field of quantum Hamiltonian complexity lies at the intersection of quantum many-body physics and computational complexity theory, with deep implications to both fields. The main object of study is the LocalHamiltonian problem, which is concerned with estimating the ground-state energy of a local Hamiltonian and is complete for the class QMA, a quantum generalization of the class NP. A major challenge in the field is to understand the complexity of the LocalHamiltonian problem in more physically natural parameter regimes. One crucial parameter in understanding the ground space of any Hamiltonian in many-body physics is the spectral gap, which is the difference between the smallest two eigenvalues. Despite its importance in quantum many-body physics, the role played by the spectral gap in the complexity of the LocalHamiltonian is less well-understood. In this work, we make progress on this question by considering the precise regime, in which one estimates the ground-state energy to within inverse exponential precision. Computing ground-state energies precisely is a task that is important for quantum chemistry and quantum many-body physics. In the setting of inverse-exponential precision, there is a surprising result that the complexity of LocalHamiltonian is magnified from QMA to PSPACE, the class of problems solvable in polynomial space. We clarify the reason behind this boost in complexity. Specifically, we show that the full complexity of the high precision case only comes about when the spectral gap is exponentially small. As a consequence of the proof techniques developed to show our results, we uncover important implications for the representability and circuit complexity of ground states of local Hamiltonians, the theory of uniqueness of quantum witnesses, and techniques for the amplification of quantum witnesses in the presence of postselection.
We propose a monitoring strategy for efficient and robust estimation of disease prevalence and case numbers within closed and enumerated populations such as schools, workplaces, or retirement communities. The proposed design relies largely on voluntary testing, notoriously biased (e.g., in the case of COVID-19) due to non-representative sampling. The approach yields unbiased and comparatively precise estimates with no assumptions about factors underlying selection of individuals for voluntary testing, building on the strength of what can be a small random sampling component. This component unlocks a previously proposed "anchor stream" estimator, a well-calibrated alternative to classical capture-recapture (CRC) estimators based on two data streams. We show here that this estimator is equivalent to a direct standardization based on "capture", i.e., selection (or not) by the voluntary testing program, made possible by means of a key parameter identified by design. This equivalency simultaneously allows for novel two-stream CRC-like estimation of general means (e.g., of continuous variables such as antibody levels or biomarkers). For inference, we propose adaptations of a Bayesian credible interval when estimating case counts and bootstrapping when estimating means of continuous variables. We use simulations to demonstrate significant precision benefits relative to random sampling alone.
Good models require good training data. For overparameterized deep models, the causal relationship between training data and model predictions is increasingly opaque and poorly understood. Influence analysis partially demystifies training's underlying interactions by quantifying the amount each training instance alters the final model. Measuring the training data's influence exactly can be provably hard in the worst case; this has led to the development and use of influence estimators, which only approximate the true influence. This paper provides the first comprehensive survey of training data influence analysis and estimation. We begin by formalizing the various, and in places orthogonal, definitions of training data influence. We then organize state-of-the-art influence analysis methods into a taxonomy; we describe each of these methods in detail and compare their underlying assumptions, asymptotic complexities, and overall strengths and weaknesses. Finally, we propose future research directions to make influence analysis more useful in practice as well as more theoretically and empirically sound. A curated, up-to-date list of resources related to influence analysis is available at //github.com/ZaydH/influence_analysis_papers.
We introduce a novel approach to inference on parameters that take values in a Riemannian manifold embedded in a Euclidean space. Parameter spaces of this form are ubiquitous across many fields, including chemistry, physics, computer graphics, and geology. This new approach uses generalized fiducial inference to obtain a posterior-like distribution on the manifold, without needing to know a parameterization that maps the constrained space to an unconstrained Euclidean space. The proposed methodology, called the constrained generalized fiducial distribution (CGFD), is obtained by using mathematical tools from Riemannian geometry. A Bernstein-von Mises-type result for the CGFD, which provides intuition for how the desirable asymptotic qualities of the unconstrained generalized fiducial distribution are inherited by the CGFD, is provided. To demonstrate the practical use of the CGFD, we provide three proof-of-concept examples: inference for data from a multivariate normal density with the mean parameters on a sphere, a linear logspline density estimation problem, and a reimagined approach to the AR(1) model, all of which exhibit desirable coverages via simulation. We discuss two Markov chain Monte Carlo algorithms for the exploration of these constrained parameter spaces and adapt them for the CGFD.