In many applications, a stochastic system is studied using a model implicitly defined via a simulator. We develop a simulation-based parameter inference method for implicitly defined models. Our method differs from traditional likelihood-based inference in that it uses a metamodel for the distribution of a log-likelihood estimator. The metamodel is built on a local asymptotic normality (LAN) property satisfied by the simulation-based log-likelihood estimator under certain conditions. A method for hypothesis test is developed under the metamodel. Our method can enable accurate parameter estimation and uncertainty quantification where other Monte Carlo methods for parameter inference become highly inefficient due to large Monte Carlo variance. We demonstrate our method using numerical examples including a mechanistic model for the population dynamics of infectious disease.
Utilizing massive web-scale datasets has led to unprecedented performance gains in machine learning models, but also imposes outlandish compute requirements for their training. In order to improve training and data efficiency, we here push the limits of pruning large-scale multimodal datasets for training CLIP-style models. Today's most effective pruning method on ImageNet clusters data samples into separate concepts according to their embedding and prunes away the most prototypical samples. We scale this approach to LAION and improve it by noting that the pruning rate should be concept-specific and adapted to the complexity of the concept. Using a simple and intuitive complexity measure, we are able to reduce the training cost to a quarter of regular training. By filtering from the LAION dataset, we find that training on a smaller set of high-quality data can lead to higher performance with significantly lower training costs. More specifically, we are able to outperform the LAION-trained OpenCLIP-ViT-B32 model on ImageNet zero-shot accuracy by 1.1p.p. while only using 27.7% of the data and training compute. Despite a strong reduction in training cost, we also see improvements on ImageNet dist. shifts, retrieval tasks and VTAB. On the DataComp Medium benchmark, we achieve a new state-of-the-art ImageNet zero-shot accuracy and a competitive average zero-shot accuracy on 38 evaluation tasks.
Within Bayesian nonparametrics, dependent Dirichlet process mixture models provide a highly flexible approach for conducting inference about the conditional density function. However, several formulations of this class make either rather restrictive modelling assumptions or involve intricate algorithms for posterior inference, thus preventing their widespread use. In response to these challenges, we present a flexible, versatile, and computationally tractable model for density regression based on a single-weights dependent Dirichlet process mixture of normal distributions model for univariate continuous responses. We assume an additive structure for the mean of each mixture component and incorporate the effects of continuous covariates through smooth nonlinear functions. The key components of our modelling approach are penalised B-splines and their bivariate tensor product extension. Our proposed method also seamlessly accommodates parametric effects of categorical covariates, linear effects of continuous covariates, interactions between categorical and/or continuous covariates, varying coefficient terms, and random effects, which is why we refer our model as a Dirichlet process mixture of normal structured additive regression models. A noteworthy feature of our method is its efficiency in posterior simulation through Gibbs sampling, as closed-form full conditional distributions for all model parameters are available. Results from a simulation study demonstrate that our approach successfully recovers true conditional densities and other regression functionals in various challenging scenarios. Applications to a toxicology, disease diagnosis, and agricultural study are provided and further underpin the broad applicability of our modelling framework. An R package, \texttt{DDPstar}, implementing the proposed method is publicly available at \url{//bitbucket.org/mxrodriguez/ddpstar}.
One tuple of probability vectors is more informative than another tuple when there exists a single stochastic matrix transforming the probability vectors of the first tuple into the probability vectors of the other. This is called matrix majorization. Solving an open problem raised by Mu et al, we show that if certain monotones - namely multivariate extensions of R\'{e}nyi divergences - are strictly ordered between the two tuples, then for sufficiently large $n$, there exists a stochastic matrix taking the $n$-fold Kronecker power of each input distribution to the $n$-fold Kronecker power of the corresponding output distribution. The same conditions, with non-strict ordering for the monotones, are also necessary for such matrix majorization in large samples. Our result also gives conditions for the existence of a sequence of statistical maps that asymptotically (with vanishing error) convert a single copy of each input distribution to the corresponding output distribution with the help of a catalyst that is returned unchanged. Allowing for transformation with arbitrarily small error, we find conditions that are both necessary and sufficient for such catalytic matrix majorization. We derive our results by building on a general algebraic theory of preordered semirings recently developed by one of the authors. This also allows us to recover various existing results on majorization in large samples and in the catalytic regime as well as relative majorization in a unified manner.
In designing external validation studies of clinical prediction models, contemporary sample size calculation methods are based on the frequentist inferential paradigm. One of the widely reported metrics of model performance is net benefit (NB), and the relevance of conventional inference around NB as a measure of clinical utility is doubtful. Value of Information methodology quantifies the consequences of uncertainty in terms of its impact on clinical utility of decisions. We introduce the expected value of sample information (EVSI) for validation as the expected gain in NB from conducting an external validation study of a given size. We propose algorithms for EVSI computation, and in a case study demonstrate how EVSI changes as a function of the amount of current information and future study's sample size. Value of Information methodology provides a decision-theoretic lens to the process of planning a validation study of a risk prediction model and can complement conventional methods when designing such studies.
In general, providing an axiomatization for an arbitrary logic is a task that may require some ingenuity. In the case of logics defined by a finite logical matrix (three-valued logics being a particularly simple example), the generation of suitable finite axiomatizations can be completely automatized, essentially by expressing the matrix tables via inference rules. In this chapter we illustrate how two formalisms, the 3-labelled calculi of Baaz, Ferm\"uller and Zach and the multiple-conclusion (or Set-Set) Hilbert-style calculi of Shoesmith and Smiley, may be uniformly employed to axiomatize logics defined by a three-valued logical matrix. The generating procedure common to both formalisms can be described as follows: first (i) convert the matrix semantics into rule form (we refer to this step as the generating subprocedure) and then (ii) simplify the set of rules thus obtained, essentially relying on the defining properties of any Tarskian consequence relation (we refer to this step as the streamlining subprocedure). We illustrate through some examples that, if a minimal expressiveness assumption is met (namely, if the matrix defining the logic is monadic), then it is straightforward to define effective translations guaranteeing the equivalence between the 3-labelled and the Set-Set approach.
Spatial areal models encounter the well-known and challenging problem of spatial confounding. This issue makes it arduous to distinguish between the impacts of observed covariates and spatial random effects. Despite previous research and various proposed methods to tackle this problem, finding a definitive solution remains elusive. In this paper, we propose a simplified version of the spatial+ approach that involves dividing the covariate into two components. One component captures large-scale spatial dependence, while the other accounts for short-scale dependence. This approach eliminates the need to separately fit spatial models for the covariates. We apply this method to analyse two forms of crimes against women, namely rapes and dowry deaths, in Uttar Pradesh, India, exploring their relationship with socio-demographic covariates. To evaluate the performance of the new approach, we conduct extensive simulation studies under different spatial confounding scenarios. The results demonstrate that the proposed method provides reliable estimates of fixed effects and posterior correlations between different responses.
Linear codes are widely studied in coding theory as they have nice applications in distributed storage, combinatorics, lattices, cryptography and so on. Constructing linear codes with desirable properties is an interesting research topic. In this paper, based on the augmentation technique, we present two families of linear codes from some functions over finite fields. The first family of linear codes is constructed from monomial functions over finite fields. The locality of them is determined and the weight distributions of two subfamilies of the codes are also given. An infinite family of locally recoverable codes which are at least almost optimal and some optimal recoverable codes are obtained from the linear codes. In particular, the two subfamilies of the codes are proved to be both optimally or almost optimally extendable and self-orthogonal. The second family of linear codes is constructed from weakly regular bent functions over finite fields and their weight distribution is determined. This family of codes is proved to have locality 3 for some cases and is conjectured to have locality 2 for other cases. Particularly, two families of optimal locally recoverable codes are derived from the linear codes. Besides, this family of codes is also proved to be both optimally or almost optimally extendable and self-orthogonal.
We introduce two iterative methods, GPBiLQ and GPQMR, for solving unsymmetric partitioned linear systems. The basic mechanism underlying GPBiLQ and GPQMR is a novel simultaneous tridiagonalization via biorthogonality that allows for short-recurrence iterative schemes. Similar to the biconjugate gradient method, it is possible to develop another method, GPBiCG, whose iterate (if it exists) can be obtained inexpensively from the GPBiLQ iterate. Whereas the iterate of GPBiCG may not exist, the iterates of GPBiLQ and GPQMR are always well defined as long as the biorthogonal tridiagonal reduction process does not break down. We discuss connections between the proposed methods and some existing methods, and give numerical experiments to illustrate the performance of the proposed methods.
The present work is devoted to strong approximations of a generalized Ait-Sahalia model arising from mathematical finance. The numerical study of the considered model faces essential difficulties caused by a drift that blows up at the origin, highly nonlinear drift and diffusion coefficients and positivity-preserving requirement. In this paper, a novel explicit Euler-type scheme is proposed, which is easily implementable and able to preserve positivity of the original model unconditionally, i.e., for any time step-size h>0. A mean-square convergence rate of order 0.5 is also obtained for the proposed scheme in both non-critical and general critical cases. Our work is motivated by the need to justify the multi-level Monte Carlo (MLMC) simulations for the underlying model, where the rate of mean-square convergence is required and the preservation of positivity is desirable particularly for large discretization time steps. To the best of our knowledge, this is the first paper to propose an unconditionally positivity preserving explicit scheme with order 1/2 of mean-square convergence for the model. Numerical experiments are finally provided to confirm the theoretical findings.
Stochastic Differential Equations (SDEs) serve as a powerful modeling tool in various scientific domains, including systems science, engineering, and ecological science. While the specific form of SDEs is typically known for a given problem, certain model parameters remain unknown. Efficiently inferring these unknown parameters based on observations of the state in discrete time series represents a vital practical subject. The challenge arises in nonlinear SDEs, where maximum likelihood estimation of parameters is generally unfeasible due to the absence of closed-form expressions for transition and stationary probability density functions of the states. In response to this limitation, we propose a novel two-step parameter inference mechanism. This approach involves a global-search phase followed by a local-refining procedure. The global-search phase is dedicated to identifying the domain of high-value likelihood functions, while the local-refining procedure is specifically designed to enhance the surrogate likelihood within this localized domain. Additionally, we present two simulation-based approximations for the transition density, aiming to efficiently or accurately approximate the likelihood function. Numerical examples illustrate the efficacy of our proposed methodology in achieving posterior parameter estimation.