In the usual Bayesian setting, a full probabilistic model is required to link the data and parameters, and the form of this model and the inference and prediction mechanisms are specified via de Finetti's representation. In general, such a formulation is not robust to model mis-specification of its component parts. An alternative approach is to draw inference based on loss functions, where the quantity of interest is defined as a minimizer of some expected loss, and to construct posterior distributions based on the loss-based formulation; this strategy underpins the construction of the Gibbs posterior. We develop a Bayesian non-parametric approach; specifically, we generalize the Bayesian bootstrap, and specify a Dirichlet process model for the distribution of the observables. We implement this using direct prior-to-posterior calculations, but also using predictive sampling. We also study the assessment of posterior validity for non-standard Bayesian calculations, and provide an efficient way to calibrate the scaling parameter in the Gibbs posterior so that it can achieve the desired coverage rate. We show that the developed non-standard Bayesian updating procedures yield valid posterior distributions in terms of consistency and asymptotic normality under model mis-specification. Simulation studies show that the proposed methods can recover the true value of the parameter efficiently and achieve frequentist coverage even when the sample size is small. Finally, we apply our methods to evaluate the causal impact of speed cameras on traffic collisions in England.
Without writing a single line of code by a human, an example Monte Carlo simulation based application for stochastic dependence modeling with copulas is developed using a state-of-the-art large language model (LLM) fine-tuned for conversations. This includes interaction with ChatGPT in natural language and using mathematical formalism, which, under careful supervision by a human-expert, led to producing a working code in MATLAB, Python and R for sampling from a given copula model, evaluation of the model's density, performing maximum likelihood estimation, optimizing the code for parallel computing for CPUs as well as for GPUs, and visualization of the computed results. In contrast to other emerging studies that assess the accuracy of LLMs like ChatGPT on tasks from a selected area, this work rather investigates ways how to achieve a successful solution of a standard statistical task in a collaboration of a human-expert and artificial intelligence (AI). Particularly, through careful prompt engineering, we separate successful solutions generated by ChatGPT from unsuccessful ones, resulting in a comprehensive list of related pros and cons. It is demonstrated that if the typical pitfalls are avoided, we can substantially benefit from collaborating with an AI partner. For example, we show that if ChatGPT is not able to provide a correct solution due to a lack of or incorrect knowledge, the human-expert can feed it with the correct knowledge, e.g., in the form of mathematical theorems and formulas, and make it to apply the gained knowledge in order to provide a solution that is correct. Such ability presents an attractive opportunity to achieve a programmed solution even for users with rather limited knowledge of programming techniques.
This paper describes a purely functional library for computing level-$p$-complexity of Boolean functions, and applies it to two-level iterated majority. Boolean functions are simply functions from $n$ bits to one bit, and they can describe digital circuits, voting systems, etc. An example of a Boolean function is majority, which returns the value that has majority among the $n$ input bits for odd $n$. The complexity of a Boolean function $f$ measures the cost of evaluating it: how many bits of the input are needed to be certain about the result of $f$. There are many competing complexity measures but we focus on level-$p$-complexity -- a function of the probability $p$ that a bit is 1. The level-$p$-complexity $D_p(f)$ is the minimum expected cost when the input bits are independent and identically distributed with Bernoulli($p$) distribution. We specify the problem as choosing the minimum expected cost of all possible decision trees -- which directly translates to a clearly correct, but very inefficient implementation. The library uses thinning and memoization for efficiency and type classes for separation of concerns. The complexity is represented using polynomials, and the order relation used for thinning is implemented using polynomial factorisation and root-counting. Finally we compute the complexity for two-level iterated majority and improve on an earlier result by J.~Jansson.
Data-driven offline model-based optimization (MBO) is an established practical approach to black-box computational design problems for which the true objective function is unknown and expensive to query. However, the standard approach which optimizes designs against a learned proxy model of the ground truth objective can suffer from distributional shift. Specifically, in high-dimensional design spaces where valid designs lie on a narrow manifold, the standard approach is susceptible to producing out-of-distribution, invalid designs that "fool" the learned proxy model into outputting a high value. Using an ensemble rather than a single model as the learned proxy can help mitigate distribution shift, but naive formulations for combining gradient information from the ensemble, such as minimum or mean gradient, are still suboptimal and often hampered by non-convergent behavior. In this work, we explore alternate approaches for combining gradient information from the ensemble that are robust to distribution shift without compromising optimality of the produced designs. More specifically, we explore two functions, formulated as convex optimization problems, for combining gradient information: multiple gradient descent algorithm (MGDA) and conflict-averse gradient descent (CAGrad). We evaluate these algorithms on a diverse set of five computational design tasks. We compare performance of ensemble MBO with MGDA and ensemble MBO with CAGrad with three naive baseline algorithms: (a) standard single-model MBO, (b) ensemble MBO with mean gradient, and (c) ensemble MBO with minimum gradient. Our results suggest that MGDA and CAGrad strike a desirable balance between conservatism and optimality and can help robustify data-driven offline MBO without compromising optimality of designs.
Multiple systems estimation is a standard approach to quantifying hidden populations where data sources are based on lists of known cases. A typical modelling approach is to fit a Poisson loglinear model to the numbers of cases observed in each possible combination of the lists. It is necessary to decide which interaction parameters to include in the model, and information criterion approaches are often used for model selection. Difficulties in the context of multiple systems estimation may arise due to sparse or nil counts based on the intersection of lists, and care must be taken when information criterion approaches are used for model selection due to issues relating to the existence of estimates and identifiability of the model. Confidence intervals are often reported conditional on the model selected, providing an over-optimistic impression of the accuracy of the estimation. A bootstrap approach is a natural way to account for the model selection procedure. However, because the model selection step has to be carried out for every bootstrap replication, there may be a high or even prohibitive computational burden. We explore the merit of modifying the model selection procedure in the bootstrap to look only among a subset of models, chosen on the basis of their information criterion score on the original data. This provides large computational gains with little apparent effect on inference. Another model selection approach considered and investigated is a downhill search approach among models, possibly with multiple starting points.
To simulate bosons on a qubit- or qudit-based quantum computer, one has to regularize the theory by truncating infinite-dimensional local Hilbert spaces to finite dimensions. In the search for practical quantum applications, it is important to know how big the truncation errors can be. In general, it is not easy to estimate errors unless we have a good quantum computer. In this paper we show that traditional sampling methods on classical devices, specifically Markov Chain Monte Carlo, can address this issue with a reasonable amount of computational resources available today. As a demonstration, we apply this idea to the scalar field theory on a two-dimensional lattice, with a size that goes beyond what is achievable using exact diagonalization methods. This method can be used to estimate the resources needed for realistic quantum simulations of bosonic theories, and also, to check the validity of the results of the corresponding quantum simulations.
Completely random measures (CRMs) provide a broad class of priors, arguably, the most popular, for Bayesian nonparametric (BNP) analysis of trait allocations. As a peculiar property, CRM priors lead to predictive distributions that share the following common structure: for fixed prior's parameters, a new data point exhibits a Poisson (random) number of ``new'' traits, i.e., not appearing in the sample, which depends on the sampling information only through the sample size. While the Poisson posterior distribution is appealing for analytical tractability and ease of interpretation, its independence from the sampling information is a critical drawback, as it makes the posterior distribution of ``new'' traits completely determined by the estimation of the unknown prior's parameters. In this paper, we introduce the class of transform-scaled process (T-SP) priors as a tool to enrich the posterior distribution of ``new'' traits arising from CRM priors, while maintaining the same analytical tractability and ease of interpretation. In particular, we present a framework for posterior analysis of trait allocations under T-SP priors, showing that Stable T-SP priors, i.e., T-SP priors built from Stable CRMs, lead to predictive distributions such that, for fixed prior's parameters, a new data point displays a negative-Binomial (random) number of ``new'' traits, which depends on the sampling information through the number of distinct traits and the sample size. Then, by relying on a hierarchical version of T-SP priors, we extend our analysis to the more general setting of trait allocations with multiple groups of data or subpopulations. The empirical effectiveness of our methods is demonstrated through numerical experiments and applications to real data.
Causal discovery and causal reasoning are classically treated as separate and consecutive tasks: one first infers the causal graph, and then uses it to estimate causal effects of interventions. However, such a two-stage approach is uneconomical, especially in terms of actively collected interventional data, since the causal query of interest may not require a fully-specified causal model. From a Bayesian perspective, it is also unnatural, since a causal query (e.g., the causal graph or some causal effect) can be viewed as a latent quantity subject to posterior inference -- other unobserved quantities that are not of direct interest (e.g., the full causal model) ought to be marginalized out in this process and contribute to our epistemic uncertainty. In this work, we propose Active Bayesian Causal Inference (ABCI), a fully-Bayesian active learning framework for integrated causal discovery and reasoning, which jointly infers a posterior over causal models and queries of interest. In our approach to ABCI, we focus on the class of causally-sufficient, nonlinear additive noise models, which we model using Gaussian processes. We sequentially design experiments that are maximally informative about our target causal query, collect the corresponding interventional data, and update our beliefs to choose the next experiment. Through simulations, we demonstrate that our approach is more data-efficient than several baselines that only focus on learning the full causal graph. This allows us to accurately learn downstream causal queries from fewer samples while providing well-calibrated uncertainty estimates for the quantities of interest.
Understanding causality helps to structure interventions to achieve specific goals and enables predictions under interventions. With the growing importance of learning causal relationships, causal discovery tasks have transitioned from using traditional methods to infer potential causal structures from observational data to the field of pattern recognition involved in deep learning. The rapid accumulation of massive data promotes the emergence of causal search methods with brilliant scalability. Existing summaries of causal discovery methods mainly focus on traditional methods based on constraints, scores and FCMs, there is a lack of perfect sorting and elaboration for deep learning-based methods, also lacking some considers and exploration of causal discovery methods from the perspective of variable paradigms. Therefore, we divide the possible causal discovery tasks into three types according to the variable paradigm and give the definitions of the three tasks respectively, define and instantiate the relevant datasets for each task and the final causal model constructed at the same time, then reviews the main existing causal discovery methods for different tasks. Finally, we propose some roadmaps from different perspectives for the current research gaps in the field of causal discovery and point out future research directions.
Since deep neural networks were developed, they have made huge contributions to everyday lives. Machine learning provides more rational advice than humans are capable of in almost every aspect of daily life. However, despite this achievement, the design and training of neural networks are still challenging and unpredictable procedures. To lower the technical thresholds for common users, automated hyper-parameter optimization (HPO) has become a popular topic in both academic and industrial areas. This paper provides a review of the most essential topics on HPO. The first section introduces the key hyper-parameters related to model training and structure, and discusses their importance and methods to define the value range. Then, the research focuses on major optimization algorithms and their applicability, covering their efficiency and accuracy especially for deep learning networks. This study next reviews major services and toolkits for HPO, comparing their support for state-of-the-art searching algorithms, feasibility with major deep learning frameworks, and extensibility for new modules designed by users. The paper concludes with problems that exist when HPO is applied to deep learning, a comparison between optimization algorithms, and prominent approaches for model evaluation with limited computational resources.
With the rapid increase of large-scale, real-world datasets, it becomes critical to address the problem of long-tailed data distribution (i.e., a few classes account for most of the data, while most classes are under-represented). Existing solutions typically adopt class re-balancing strategies such as re-sampling and re-weighting based on the number of observations for each class. In this work, we argue that as the number of samples increases, the additional benefit of a newly added data point will diminish. We introduce a novel theoretical framework to measure data overlap by associating with each sample a small neighboring region rather than a single point. The effective number of samples is defined as the volume of samples and can be calculated by a simple formula $(1-\beta^{n})/(1-\beta)$, where $n$ is the number of samples and $\beta \in [0,1)$ is a hyperparameter. We design a re-weighting scheme that uses the effective number of samples for each class to re-balance the loss, thereby yielding a class-balanced loss. Comprehensive experiments are conducted on artificially induced long-tailed CIFAR datasets and large-scale datasets including ImageNet and iNaturalist. Our results show that when trained with the proposed class-balanced loss, the network is able to achieve significant performance gains on long-tailed datasets.