Some families of count distributions do not have a closed form of the probability mass function and/or finite moments and therefore parameter estimation can not be performed with the classical methods. When the probability generating function of the distribution is available, a new approach based on censoring and moment criterion is introduced, where the original distribution is replaced with that censored by using a Geometric distribution. Consistency and asymptotic normality of the resulting estimators are proven under suitable conditions. The crucial issue of selecting the censoring parameter is addressed by means of a data-driven procedure. Finally, this novel approach is applied to the discrete stable family and the finite sample performance of the estimators is assessed by means of a Monte Carlo simulation study.
In this work, we study the problem of robustly estimating the mean/location parameter of distributions without moment bounds. For a large class of distributions satisfying natural symmetry constraints we give a sequence of algorithms that can efficiently estimate its location without incurring dimension-dependent factors in the error. Concretely, suppose an adversary can arbitrarily corrupt an $\varepsilon$-fraction of the observed samples. For every $k \in \mathbb{N}$, we design an estimator using time and samples $\tilde{O}({d^k})$ such that the dependence of the error on the corruption level $\varepsilon$ is an additive factor of $O(\varepsilon^{1-\frac{1}{2k}})$. The dependence on other problem parameters is also nearly optimal. Our class contains products of arbitrary symmetric one-dimensional distributions as well as elliptical distributions, a vast generalization of the Gaussian distribution. Examples include product Cauchy distributions and multi-variate $t$-distributions. In particular, even the first moment might not exist. We provide the first efficient algorithms for this class of distributions. Previously, such results where only known under boundedness assumptions on the moments of the distribution and in particular, are provably impossible in the absence of symmetry [KSS18, CTBJ22]. For the class of distributions we consider, all previous estimators either require exponential time or incur error depending on the dimension. Our algorithms are based on a generalization of the filtering technique [DK22]. We show how this machinery can be combined with Huber-loss-based approach to work with projections of the noise. Moreover, we show how sum-of-squares proofs can be used to obtain algorithmic guarantees even for distributions without first moment. We believe that this approach may find other application in future works.
Density Ratio Estimation (DRE) is an important machine learning technique with many downstream applications. We consider the challenge of DRE with missing not at random (MNAR) data. In this setting, we show that using standard DRE methods leads to biased results while our proposal (M-KLIEP), an adaptation of the popular DRE procedure KLIEP, restores consistency. Moreover, we provide finite sample estimation error bounds for M-KLIEP, which demonstrate minimax optimality with respect to both sample size and worst-case missingness. We then adapt an important downstream application of DRE, Neyman-Pearson (NP) classification, to this MNAR setting. Our procedure both controls Type I error and achieves high power, with high probability. Finally, we demonstrate promising empirical performance both synthetic data and real-world data with simulated missingness.
A popular way to estimate the parameters of a hidden Markov model (HMM) is direct numerical maximization (DNM) of the (log-)likelihood function. The advantages of employing the TMB (Kris- tensen et al., 2016) framework in R for this purpose were illustrated recently Bacri et al. (2022). In this paper, we present extensions of these results in two directions. First, we present a practical way to obtain uncertainty estimates in form of confidence intervals (CIs) for the so-called smoothing probabilities at moderate computational and programming effort via TMB. Our approach thus permits to avoid computer-intensive bootstrap methods. By means of several ex- amples, we illustrate patterns present for the derived CIs. Secondly, we investigate the performance of popular optimizers available in R when estimating HMMs via DNM. Hereby, our focus lies on the potential benefits of employing TMB. Investigated criteria via a number of simulation studies are convergence speed, accuracy, and the impact of (poor) initial values. Our findings suggest that all optimizers considered benefit in terms of speed from using the gradient supplied by TMB. When supplying both gradient and Hessian from TMB, the number of iterations reduces, suggesting a more efficient convergence to the maximum of the log-likelihood. Last, we briefly point out potential advantages of a hybrid approach.
Theoretically, the conditional expectation of a square-integrable random variable $Y$ given a $d$-dimensional random vector $X$ can be obtained by minimizing the mean squared distance between $Y$ and $f(X)$ over all Borel measurable functions $f \colon \mathbb{R}^d \to \mathbb{R}$. However, in many applications this minimization problem cannot be solved exactly, and instead, a numerical method which computes an approximate minimum over a suitable subfamily of Borel functions has to be used. The quality of the result depends on the adequacy of the subfamily and the performance of the numerical method. In this paper, we derive an expected value representation of the minimal mean squared distance which in many applications can efficiently be approximated with a standard Monte Carlo average. This enables us to provide guarantees for the accuracy of any numerical approximation of a given conditional expectation. We illustrate the method by assessing the quality of approximate conditional expectations obtained by linear, polynomial and neural network regression in different concrete examples.
Physics-informed deep learning have recently emerged as an effective tool for leveraging both observational data and available physical laws. Physics-informed neural networks (PINNs) and deep operator networks (DeepONets) are two such models. The former encodes the physical laws via the automatic differentiation, while the latter learns the hidden physics from data. Generally, the noisy and limited observational data as well as the overparameterization in neural networks (NNs) result in uncertainty in predictions from deep learning models. In [1], a Bayesian framework based on the {{Generative Adversarial Networks}} (GAN) has been proposed as a unified model to quantify uncertainties in predictions of PINNs as well as DeepONets. Specifically, the proposed approach in [1] has two stages: (1) prior learning, and (2) posterior estimation. At the first stage, the GANs are employed to learn a functional prior either from a prescribed function distribution, e.g., Gaussian process, or from historical data and available physics. At the second stage, the Hamiltonian Monte Carlo (HMC) method is utilized to estimate the posterior in the latent space of GANs. However, the vanilla HMC does not support the mini-batch training, which limits its applications in problems with big data. In the present work, we propose to use the normalizing flow (NF) models in the context of variational inference, which naturally enables the minibatch training, as the alternative to HMC for posterior estimation in the latent space of GANs. A series of numerical experiments, including a nonlinear differential equation problem and a 100-dimensional Darcy problem, are conducted to demonstrate that NF with full-/mini-batch training are able to achieve similar accuracy as the ``gold rule'' HMC.
Sampling from matrix generalized inverse Gaussian (MGIG) distributions is required in Markov Chain Monte Carlo (MCMC) algorithms for a variety of statistical models. However, an efficient sampling scheme for the MGIG distributions has not been fully developed. We here propose a novel blocked Gibbs sampler for the MGIG distributions, based on the Choleski decomposition. We show that the full conditionals of the diagonal and unit lower-triangular entries are univariate generalized inverse Gaussian and multivariate normal distributions, respectively. Several variants of the Metropolis-Hastings algorithm can also be considered for this problem, but we mathematically prove that the average acceptance rates become extremely low in particular scenarios. We demonstrate the computational efficiency of the proposed Gibbs sampler through simulation studies and data analysis.
Understanding how different classes are distributed in an unlabeled data set is an important challenge for the calibration of probabilistic classifiers and uncertainty quantification. Approaches like adjusted classify and count, black-box shift estimators, and invariant ratio estimators use an auxiliary (and potentially biased) black-box classifier trained on a different (shifted) data set to estimate the class distribution and yield asymptotic guarantees under weak assumptions. We demonstrate that all these algorithms are closely related to the inference in a particular Bayesian model, approximating the assumed ground-truth generative process. Then, we discuss an efficient Markov Chain Monte Carlo sampling scheme for the introduced model and show an asymptotic consistency guarantee in the large-data limit. We compare the introduced model against the established point estimators in a variety of scenarios, and show it is competitive, and in some cases superior, with the state of the art.
Semiparametric inference on average causal effects from observational data is based on assumptions yielding identification of the effects. In practice, several distinct identifying assumptions may be plausible; an analyst has to make a delicate choice between these models. In this paper, we study three identifying assumptions based on the potential outcome framework: the back-door assumption, which uses pre-treatment covariates, the front-door assumption, which uses mediators, and the two-door assumption using pre-treatment covariates and mediators simultaneously. We provide the efficient influence functions and the corresponding semiparametric efficiency bounds that hold under these assumptions, and their combinations. We demonstrate that neither of the identification models provides uniformly the most efficient estimation and give conditions under which some bounds are lower than others. We show when semiparametric estimating equation estimators based on influence functions attain the bounds, and study the robustness of the estimators to misspecification of the nuisance models. The theory is complemented with simulation experiments on the finite sample behavior of the estimators. The results obtained are relevant for an analyst facing a choice between several plausible identifying assumptions and corresponding estimators. Our results show that this choice implies a trade-off between efficiency and robustness to misspecification of the nuisance models.
The problems of selecting partial correlation and causality graphs for count data are considered. A parameter driven generalized linear model is used to describe the observed multivariate time series of counts. Partial correlation and causality graphs corresponding to this model explain the dependencies between each time series of the multivariate count data. In order to estimate these graphs with tunable sparsity, an appropriate likelihood function maximization is regularized with an l1-type constraint. A novel MCEM algorithm is proposed to iteratively solve this regularized MLE. Asymptotic convergence results are proved for the sequence generated by the proposed MCEM algorithm with l1-type regularization. The algorithm is first successfully tested on simulated data. Thereafter, it is applied to observed weekly dengue disease counts from each ward of Greater Mumbai city. The interdependence of various wards in the proliferation of the disease is characterized by the edges of the inferred partial correlation graph. On the other hand, the relative roles of various wards as sources and sinks of dengue spread is quantified by the number and weights of the directed edges originating from and incident upon each ward. From these estimated graphs, it is observed that some special wards act as epicentres of dengue spread even though their disease counts are relatively low.
Classic machine learning methods are built on the $i.i.d.$ assumption that training and testing data are independent and identically distributed. However, in real scenarios, the $i.i.d.$ assumption can hardly be satisfied, rendering the sharp drop of classic machine learning algorithms' performances under distributional shifts, which indicates the significance of investigating the Out-of-Distribution generalization problem. Out-of-Distribution (OOD) generalization problem addresses the challenging setting where the testing distribution is unknown and different from the training. This paper serves as the first effort to systematically and comprehensively discuss the OOD generalization problem, from the definition, methodology, evaluation to the implications and future directions. Firstly, we provide the formal definition of the OOD generalization problem. Secondly, existing methods are categorized into three parts based on their positions in the whole learning pipeline, namely unsupervised representation learning, supervised model learning and optimization, and typical methods for each category are discussed in detail. We then demonstrate the theoretical connections of different categories, and introduce the commonly used datasets and evaluation metrics. Finally, we summarize the whole literature and raise some future directions for OOD generalization problem. The summary of OOD generalization methods reviewed in this survey can be found at //out-of-distribution-generalization.com.