In many applications in biology, engineering and economics, identifying similarities and differences between distributions of data from complex processes requires comparing finite categorical samples of discrete counts. Statistical divergences quantify the difference between two distributions. However, their estimation is very difficult and empirical methods often fail, especially when the samples are small. We develop a Bayesian estimator of the Kullback-Leibler divergence between two probability distributions that makes use of a mixture of Dirichlet priors on the distributions being compared. We study the properties of the estimator on two examples: probabilities drawn from Dirichlet distributions, and random strings of letters drawn from Markov chains. We extend the approach to the squared Hellinger divergence. Both estimators outperform other estimation techniques, with better results for data with a large number of categories and for higher values of divergences.
We extend several relative perturbation bounds to Hermitian matrices that are possibly singular, and also develop a general class of relative bounds for Hermitian matrices. As a result, corresponding relative bounds for singular values of rank-deficient $m\times n$ matrices are also obtained using the Jordan-Wielandt matrices. We also present that the main relative bound derived would be invariant with respect to congruence transformation under certain conditions, and compare its sharpness with the Weyl's absolute perturbation bound.
Recently, several algorithms have been proposed for decomposing reactive synthesis specifications into independent and simpler sub-specifications. Being inspired by one of the approaches, developed by Antonio Iannopollo (2018), who designed the so-called (DC) algorithm, we present here our solution that takes his ideas further and provides mathematical formalisation of the strategy behind DC. We rigorously define the main notions involved in the algorithm, explain the technique, and demonstrate its application on examples. The core technique of DC is based on the detection of independent variables in linear temporal logic formulae by exploiting the power and efficiency of a model checker. Although the DC algorithm is sound, it is not complete, as its author already pointed out. In this paper, we provide a counterexample that shows this fact and propose relevant changes to adapt the original DC strategy to ensure its correctness. The modification of DC and the detailed proof of its soundness and completeness are the main contributions of this work.
Non-autoregressive approaches aim to improve the inference speed of translation models, particularly those that generate output in a one-pass forward manner. However, these approaches often suffer from a significant drop in translation quality compared to autoregressive models. This paper introduces a series of innovative techniques to enhance the translation quality of Non-Autoregressive Translation (NAT) models while maintaining a substantial acceleration in inference speed. We propose fine-tuning Pretrained Multilingual Language Models (PMLMs) with the CTC loss to train NAT models effectively. Furthermore, we adopt the MASK insertion scheme for up-sampling instead of token duplication, and we present an embedding distillation method to further enhance performance. In our experiments, our model outperforms the baseline autoregressive model (Transformer \textit{base}) on multiple datasets, including WMT'14 DE$\leftrightarrow$EN, WMT'16 RO$\leftrightarrow$EN, and IWSLT'14 DE$\leftrightarrow$EN. Notably, our model achieves better performance than the baseline autoregressive model on the IWSLT'14 En$\leftrightarrow$De and WMT'16 En$\leftrightarrow$Ro datasets, even without using distillation data during training. It is worth highlighting that on the IWSLT'14 DE$\rightarrow$EN dataset, our model achieves an impressive BLEU score of 39.59, setting a new state-of-the-art performance. Additionally, our model exhibits a remarkable speed improvement of 16.35 times compared to the autoregressive model.
The joint retrieval of surface reflectances and atmospheric parameters in VSWIR imaging spectroscopy is a computationally challenging high-dimensional problem. Using NASA's Surface Biology and Geology mission as the motivational context, the uncertainty associated with the retrievals is crucial for further application of the retrieved results for environmental applications. Although Markov chain Monte Carlo (MCMC) is a Bayesian method ideal for uncertainty quantification, the full-dimensional implementation of MCMC for the retrieval is computationally intractable. In this work, we developed a block Metropolis MCMC algorithm for the high-dimensional VSWIR surface reflectance retrieval that leverages the structure of the forward radiative transfer model to enable tractable fully Bayesian computation. We use the posterior distribution from this MCMC algorithm to assess the limitations of optimal estimation, the state-of-the-art Bayesian algorithm in operational retrievals which is more computationally efficient but uses a Gaussian approximation to characterize the posterior. Analyzing the differences in the posterior computed by each method, the MCMC algorithm was shown to give more physically sensible results and reveals the non-Gaussian structure of the posterior, specifically in the atmospheric aerosol optical depth parameter and the low-wavelength surface reflectances.
Whether or not the Kronecker coefficients of the symmetric group count some set of combinatorial objects is a longstanding open question. In this work we show that a given Kronecker coefficient is proportional to the rank of a projector that can be measured efficiently using a quantum computer. In other words a Kronecker coefficient counts the dimension of the vector space spanned by the accepting witnesses of a QMA verifier, where QMA is the quantum analogue of NP. This implies that approximating the Kronecker coefficients to within a given relative error is not harder than a certain natural class of quantum approximate counting problems that captures the complexity of estimating thermal properties of quantum many-body systems. A second consequence is that deciding positivity of Kronecker coefficients is contained in QMA, complementing a recent NP-hardness result of Ikenmeyer, Mulmuley and Walter. We obtain similar results for the related problem of approximating row sums of the character table of the symmetric group. Finally, we discuss an efficient quantum algorithm that approximates normalized Kronecker coefficients to inverse-polynomial additive error.
Enumeration problems are often encountered as key subroutines in the exact computation of graph parameters such as chromatic number, treewidth, or treedepth. In the case of treedepth computation, the enumeration of inclusion-wise minimal separators plays a crucial role. However and quite surprisingly, the complexity status of this problem has not been settled since it has been posed as an open direction by Kloks and Kratsch in 1998. Recently at the PACE 2020 competition dedicated to treedepth computation, solvers have been circumventing that by listing all minimal $a$-$b$ separators and filtering out those that are not inclusion-wise minimal, at the cost of efficiency. Naturally, having an efficient algorithm for listing inclusion-wise minimal separators would drastically improve such practical algorithms. In this note, however, we show that no efficient algorithm is to be expected from an output-sensitive perspective, namely, we prove that there is no output-polynomial time algorithm for inclusion-wise minimal separators enumeration unless P = NP.
Statistical techniques are needed to analyse data structures with complex dependencies such that clinically useful information can be extracted. Individual-specific networks, which capture dependencies in complex biological systems, are often summarized by graph-theoretical features. These features, which lend themselves to outcome modelling, can be subject to high variability due to arbitrary decisions in network inference and noise. Correlation-based adjacency matrices often need to be sparsified before meaningful graph-theoretical features can be extracted, requiring the data analysts to determine an optimal threshold.. To address this issue, we propose to incorporate a flexible weighting function over the full range of possible thresholds to capture the variability of graph-theoretical features over the threshold domain. The potential of this approach, which extends concepts from functional data analysis to a graph-theoretical setting, is explored in a plasmode simulation study using real functional magnetic resonance imaging (fMRI) data from the Autism Brain Imaging Data Exchange (ABIDE) Preprocessed initiative. The simulations show that our modelling approach yields accurate estimates of the functional form of the weight function, improves inference efficiency, and achieves a comparable or reduced root mean square prediction error compared to competitor modelling approaches. This assertion holds true in settings where both complex functional forms underlie the outcome-generating process and a universal threshold value is employed. We demonstrate the practical utility of our approach by using resting-state fMRI data to predict biological age in children. Our study establishes the flexible modelling approach as a statistically principled, serious competitor to ad-hoc methods with superior performance.
We propose and analyse an explicit boundary-preserving scheme for the strong approximations of some SDEs with non-globally Lipschitz drift and diffusion coefficients whose state-space is bounded. The scheme consists of a Lamperti transform followed by a Lie--Trotter splitting. We prove $L^{p}(\Omega)$-convergence of order $1$, for every $p \in \mathbb{N}$, of the scheme and exploit the Lamperti transform to confine the numerical approximations to the state-space of the considered SDE. We provide numerical experiments that confirm the theoretical results and compare the proposed Lamperti-splitting scheme to other numerical schemes for SDEs.
By a semi-Lagrangian change of coordinates, the hydrostatic Euler equations describing free-surface sheared flows is rewritten as a system of quasilinear equations, where stability conditions can be determined by the analysis of its hyperbolic structure. This new system can be written as a quasi linear system in time and horizontal variables and involves no more vertical derivatives. However, the coefficients in front of the horizontal derivatives include an integral operator acting on the new vertical variable. The spectrum of these operators is studied in detail, in particular it includes a continuous part. Riemann invariants are then determined as conserved quantities along the characteristic curves. Examples of solutions are provided, in particular stationary solutions and solutions blowing-up in finite time. Eventually, we propose an exact multi-layer $\mathbb{P}_0$-discretization, which could be used to solve numerically this semi-Lagrangian system, and analyze the eigenvalues of the corresponding discretized operator to investigate the hyperbolic nature of the approximated system.
In epidemiology and social sciences, propensity score methods are popular for estimating treatment effects using observational data, and multiple imputation is popular for handling covariate missingness. However, how to appropriately use multiple imputation for propensity score analysis is not completely clear. This paper aims to bring clarity on the consistency (or lack thereof) of methods that have been proposed, focusing on the within approach (where the effect is estimated separately in each imputed dataset and then the multiple estimates are combined) and the across approach (where typically propensity scores are averaged across imputed datasets before being used for effect estimation). We show that the within method is valid and can be used with any causal effect estimator that is consistent in the full-data setting. Existing across methods are inconsistent, but a different across method that averages the inverse probability weights across imputed datasets is consistent for propensity score weighting. We also comment on methods that rely on imputing a function of the missing covariate rather than the covariate itself, including imputation of the propensity score and of the probability weight. Based on consistency results and practical flexibility, we recommend generally using the standard within method. Throughout, we provide intuition to make the results meaningful to the broad audience of applied researchers.