A Lorenz curve is a graphical representation of the distribution of income or wealth within a population. The generalized Lorenz curve can be created by scaling the values on the vertical axis of a Lorenz curve by the average output of the distribution. In this paper, we propose two non-parametric methods for testing the equality of two generalized Lorenz curves. Both methods are based on empirical likelihood and utilize a U-statistic. We derive the limiting distribution of the likelihood ratio, which is shown to follow a chi-squared distribution with one degree of freedom. We performed simulations to evaluate how well the proposed methods perform compared to an existing method, by examining their Type I error rates and power across different sample sizes and distribution assumptions. Our results show that the proposed methods exhibit superior performance in finite samples, particularly in small sample sizes, and are robust across various scenarios. Finally, we use real-world data to illustrate the methods of testing two generalized Lorenz curves.
Hawkes processes are a class of self-exciting point processes that are used to model complex phenomena. While most applications of Hawkes processes assume that event data occurs in continuous-time, the less-studied discrete-time version of the process is more appropriate in some situations. In this work, we develop methodology for the efficient implementation of discrete Hawkes processes. We achieve this by developing efficient algorithms to evaluate the log-likelihood function and its gradient, whose computational complexity is linear in the number of events. We extend these methods to a particular form of a multivariate marked discrete Hawkes process which we use to model the occurrences of violent events within a forensic psychiatric hospital. A prominent feature of our problem, captured by a mark in our process, is the presence of an alarm system which can be heard throughout the hospital. An alarm is sounded when an event is particularly violent in nature and warrants a call for assistance from other members of staff. We conduct a detailed analysis showing that such a variant of the Hawkes process manages to outperform alternative models in terms of predictive power. Finally, we interpret our findings and describe their implications.
The identification of the dependent components in multiple data sets is a fundamental problem in many practical applications. The challenge in these applications is that often the data sets are high-dimensional with few observations or available samples and contain latent components with unknown probability distributions. A novel mathematical formulation of this problem is proposed, which enables the inference of the underlying correlation structure with strict false positive control. In particular, the false discovery rate is controlled at a pre-defined threshold on two levels simultaneously. The deployed test statistics originate in the sample coherence matrix. The required probability models are learned from the data using the bootstrap. Local false discovery rates are used to solve the multiple hypothesis testing problem. Compared to the existing techniques in the literature, the developed technique does not assume an a priori correlation structure and work well when the number of data sets is large while the number of observations is small. In addition, it can handle the presence of distributional uncertainties, heavy-tailed noise, and outliers.
Recently, score-based generative models have been successfully employed for the task of speech enhancement. A stochastic differential equation is used to model the iterative forward process, where at each step environmental noise and white Gaussian noise are added to the clean speech signal. While in limit the mean of the forward process ends at the noisy mixture, in practice it stops earlier and thus only at an approximation of the noisy mixture. This results in a discrepancy between the terminating distribution of the forward process and the prior used for solving the reverse process at inference. In this paper, we address this discrepancy and propose a forward process based on a Brownian bridge. We show that such a process leads to a reduction of the mismatch compared to previous diffusion processes. More importantly, we show that our approach improves in objective metrics over the baseline process with only half of the iteration steps and having one hyperparameter less to tune.
We have developed a statistical inference method applicable to a broad range of generalized linear models (GLMs) in high-dimensional settings, where the number of unknown coefficients scales proportionally with the sample size. Although a pioneering method has been developed for logistic regression, which is a specific instance of GLMs, its direct applicability to other GLMs remains limited. In this study, we address this limitation by developing a new inference method designed for a class of GLMs with asymmetric link functions. More precisely, we first introduce a novel convex loss-based estimator and its associated system, which are essential components for the inference. We next devise a methodology for identifying parameters of the system required within the method. Consequently, we construct confidence intervals for GLMs in the high-dimensional regime. We prove that our proposal has desirable theoretical properties, such as strong consistency and exact coverage probability. Finally, we confirm the validity in experiments.
Independent Component Analysis (ICA) aims to recover independent latent variables from observed mixtures thereof. Causal Representation Learning (CRL) aims instead to infer causally related (thus often statistically dependent) latent variables, together with the unknown graph encoding their causal relationships. We introduce an intermediate problem termed Causal Component Analysis (CauCA). CauCA can be viewed as a generalization of ICA, modelling the causal dependence among the latent components, and as a special case of CRL. In contrast to CRL, it presupposes knowledge of the causal graph, focusing solely on learning the unmixing function and the causal mechanisms. Any impossibility results regarding the recovery of the ground truth in CauCA also apply for CRL, while possibility results may serve as a stepping stone for extensions to CRL. We characterize CauCA identifiability from multiple datasets generated through different types of interventions on the latent causal variables. As a corollary, this interventional perspective also leads to new identifiability results for nonlinear ICA -- a special case of CauCA with an empty graph -- requiring strictly fewer datasets than previous results. We introduce a likelihood-based approach using normalizing flows to estimate both the unmixing function and the causal mechanisms, and demonstrate its effectiveness through extensive synthetic experiments in the CauCA and ICA setting.
When factorized approximations are used for variational inference (VI), they tend to underestimate the uncertainty -- as measured in various ways -- of the distributions they are meant to approximate. We consider two popular ways to measure the uncertainty deficit of VI: (i) the degree to which it underestimates the componentwise variance, and (ii) the degree to which it underestimates the entropy. To better understand these effects, and the relationship between them, we examine an informative setting where they can be explicitly (and elegantly) analyzed: the approximation of a Gaussian,~$p$, with a dense covariance matrix, by a Gaussian,~$q$, with a diagonal covariance matrix. We prove that $q$ always underestimates both the componentwise variance and the entropy of $p$, \textit{though not necessarily to the same degree}. Moreover we demonstrate that the entropy of $q$ is determined by the trade-off of two competing forces: it is decreased by the shrinkage of its componentwise variances (our first measure of uncertainty) but it is increased by the factorized approximation which delinks the nodes in the graphical model of $p$. We study various manifestations of this trade-off, notably one where, as the dimension of the problem grows, the per-component entropy gap between $p$ and $q$ becomes vanishingly small even though $q$ underestimates every componentwise variance by a constant multiplicative factor. We also use the shrinkage-delinkage trade-off to bound the entropy gap in terms of the problem dimension and the condition number of the correlation matrix of $p$. Finally we present empirical results on both Gaussian and non-Gaussian targets, the former to validate our analysis and the latter to explore its limitations.
In Bayesian inference, the approximation of integrals of the form $\psi = \mathbb{E}_{F}{l(X)} = \int_{\chi} l(\mathbf{x}) d F(\mathbf{x})$ is a fundamental challenge. Such integrals are crucial for evidence estimation, which is important for various purposes, including model selection and numerical analysis. The existing strategies for evidence estimation are classified into four categories: deterministic approximation, density estimation, importance sampling, and vertical representation (Llorente et al., 2020). In this paper, we show that the Riemann sum estimator due to Yakowitz (1978) can be used in the context of nested sampling (Skilling, 2006) to achieve a $O(n^{-4})$ rate of convergence, faster than the usual Ergodic Central Limit Theorem. We provide a brief overview of the literature on the Riemann sum estimators and the nested sampling algorithm and its connections to vertical likelihood Monte Carlo. We provide theoretical and numerical arguments to show how merging these two ideas may result in improved and more robust estimators for evidence estimation, especially in higher dimensional spaces. We also briefly discuss the idea of simulating the Lorenz curve that avoids the problem of intractable $\Lambda$ functions, essential for the vertical representation and nested sampling.
Computational method for statistical measures of reliability, confidence, and assurance are available for infinite population size. If the population size is finite and small compared to the number of samples tested, these computational methods need to be improved for a better representation of reality. This article discusses how to compute reliability, confidence, and assurance statistics for finite number of samples. Graphs and tables are provided as examples and can be used for low number of test sample sizes. Two open-source python libraries are provided for computing reliability, confidence, and assurance with both infinite and finite number of samples.
The empirical likelihood is a powerful nonparametric tool, that emulates its parametric counterpart -- the parametric likelihood -- preserving many of its large-sample properties. This article tackles the problem of assessing the discriminatory power of three-class diagnostic tests from an empirical likelihood perspective. In particular, we concentrate on interval estimation in a three-class ROC analysis, where a variety of inferential tasks could be of interest. We present novel theoretical results and tailored techniques studied to efficiently solve some of such tasks. Extensive simulation experiments are provided in a supporting role, with our novel proposals compared to existing competitors, when possible. It emerges that our new proposals are extremely flexible, being able to compete with contestants and being the most suited to accommodating flexible distributions for target populations. We illustrate the application of the novel proposals with a real data example. The article ends with a discussion and a presentation of some directions for future research.
Causal inference is a critical research topic across many domains, such as statistics, computer science, education, public policy and economics, for decades. Nowadays, estimating causal effect from observational data has become an appealing research direction owing to the large amount of available data and low budget requirement, compared with randomized controlled trials. Embraced with the rapidly developed machine learning area, various causal effect estimation methods for observational data have sprung up. In this survey, we provide a comprehensive review of causal inference methods under the potential outcome framework, one of the well known causal inference framework. The methods are divided into two categories depending on whether they require all three assumptions of the potential outcome framework or not. For each category, both the traditional statistical methods and the recent machine learning enhanced methods are discussed and compared. The plausible applications of these methods are also presented, including the applications in advertising, recommendation, medicine and so on. Moreover, the commonly used benchmark datasets as well as the open-source codes are also summarized, which facilitate researchers and practitioners to explore, evaluate and apply the causal inference methods.