We propose a novel two-stage subsampling algorithm based on optimal design principles. In the first stage, we use a density-based clustering algorithm to identify an approximating design space for the predictors from an initial subsample. Next, we determine an optimal approximate design on this design space. Finally, we use matrix distances such as the Procrustes, Frobenius, and square-root distance to define the remaining subsample, such that its points are "closest" to the support points of the optimal design. Our approach reflects the specific nature of the information matrix as a weighted sum of non-negative definite Fisher information matrices evaluated at the design points and applies to a large class of regression models including models where the Fisher information is of rank larger than $1$.
In this work, we introduce a novel approach to regularization in multivariable regression problems. Our regularizer, called DLoss, penalises differences between the model's derivatives and derivatives of the data generating function as estimated from the training data. We call these estimated derivatives data derivatives. The goal of our method is to align the model to the data, not only in terms of target values but also in terms of the derivatives involved. To estimate data derivatives, we select (from the training data) 2-tuples of input-value pairs, using either nearest neighbour or random, selection. On synthetic and real datasets, we evaluate the effectiveness of adding DLoss, with different weights, to the standard mean squared error loss. The experimental results show that with DLoss (using nearest neighbour selection) we obtain, on average, the best rank with respect to MSE on validation data sets, compared to no regularization, L2 regularization, and Dropout.
In this chapter, we address the challenge of exploring the posterior distributions of Bayesian inverse problems with computationally intensive forward models. We consider various multivariate proposal distributions, and compare them with single-site Metropolis updates. We show how fast, approximate models can be leveraged to improve the MCMC sampling efficiency.
We develop an inferential toolkit for analyzing object-valued responses, which correspond to data situated in general metric spaces, paired with Euclidean predictors within the conformal framework. To this end we introduce conditional profile average transport costs, where we compare distance profiles that correspond to one-dimensional distributions of probability mass falling into balls of increasing radius through the optimal transport cost when moving from one distance profile to another. The average transport cost to transport a given distance profile to all others is crucial for statistical inference in metric spaces and underpins the proposed conditional profile scores. A key feature of the proposed approach is to utilize the distribution of conditional profile average transport costs as conformity score for general metric space-valued responses, which facilitates the construction of prediction sets by the split conformal algorithm. We derive the uniform convergence rate of the proposed conformity score estimators and establish asymptotic conditional validity for the prediction sets. The finite sample performance for synthetic data in various metric spaces demonstrates that the proposed conditional profile score outperforms existing methods in terms of both coverage level and size of the resulting prediction sets, even in the special case of scalar and thus Euclidean responses. We also demonstrate the practical utility of conditional profile scores for network data from New York taxi trips and for compositional data reflecting energy sourcing of U.S. states.
Real-world misinformation can be partially correct and even factual but misleading. It undermines public trust in science and democracy, particularly on social media, where it can spread rapidly. High-quality and timely correction of misinformation that identifies and explains its (in)accuracies has been shown to effectively reduce false beliefs. Despite the wide acceptance of manual correction, it is difficult to be timely and scalable, a concern as technologies like large language models (LLMs) make misinformation easier to produce. LLMs also have versatile capabilities that could accelerate misinformation correction-however, they struggle due to a lack of recent information, a tendency to produce false content, and limitations in addressing multimodal information. We propose MUSE, an LLM augmented with access to and credibility evaluation of up-to-date information. By retrieving evidence as refutations or contexts, MUSE identifies and explains (in)accuracies in a piece of content-not presupposed to be misinformation-with references. It also describes images and conducts multimodal searches to verify and correct multimodal content. Fact-checking experts evaluate responses to social media content that are not presupposed to be (non-)misinformation but broadly include incorrect, partially correct, and correct posts, that may or may not be misleading. We propose and evaluate 13 dimensions of misinformation correction quality, ranging from the accuracy of identifications and factuality of explanations to the relevance and credibility of references. The results demonstrate MUSE's ability to promptly write high-quality responses to potential misinformation on social media-overall, MUSE outperforms GPT-4 by 37% and even high-quality responses from laypeople by 29%. This work reveals LLMs' potential to help combat real-world misinformation effectively and efficiently.
We randomize the implicit two-stage Runge-Kutta scheme in order to improve the rate of convergence (with respect to a deterministic scheme) and stability of the approximate solution (with respect to the solution generated by the explicit scheme). For stability analysis, we use Dahlquist's concept of A-stability, adopted to randomized schemes by considering three notions of stability: asymptotic, mean-square, and in probability. The randomized implicit RK2 scheme proves to be A-stable asymptotically and in probability but not in the mean-square sense.
Modern regression applications can involve hundreds or thousands of variables which motivates the use of variable selection methods. Bayesian variable selection defines a posterior distribution on the possible subsets of the variables (which are usually termed models) to express uncertainty about which variables are strongly linked to the response. This can be used to provide Bayesian model averaged predictions or inference, and to understand the relative importance of different variables. However, there has been little work on meaningful representations of this uncertainty beyond first order summaries. We introduce Cartesian credible sets to address this gap. The elements of these sets are formed by concatenating sub-models defined on each block of a partition of the variables. Investigating these sub-models allow us to understand whether the models in the Cartesian credible set always/never/sometimes include a particular variable or group of variables and provide a useful summary of model uncertainty. We introduce methods to find these sets that emphasize ease of understanding. The potential of the method is illustrated on regression problems with both small and large numbers of variables.
Bayesian sampling is an important task in statistics and machine learning. Over the past decade, many ensemble-type sampling methods have been proposed. In contrast to the classical Markov chain Monte Carlo methods, these new methods deploy a large number of interactive samples, and the communication between these samples is crucial in speeding up the convergence. To justify the validity of these sampling strategies, the concept of interacting particles naturally calls for the mean-field theory. The theory establishes a correspondence between particle interactions encoded in a set of coupled ODEs/SDEs and a PDE that characterizes the evolution of the underlying distribution. This bridges numerical algorithms with the PDE theory used to show convergence in time. Many mathematical machineries are developed to provide the mean-field analysis, and we showcase two such examples: The coupling method and the compactness argument built upon the martingale strategy. The former has been deployed to show the convergence of ensemble Kalman sampler and ensemble Kalman inversion, and the latter will be shown to be immensely powerful in proving the validity of the Vlasov-Boltzmann simulator.
Model averaging (MA), a technique for combining estimators from a set of candidate models, has attracted increasing attention in machine learning and statistics. In the existing literature, there is an implicit understanding that MA can be viewed as a form of shrinkage estimation that draws the response vector towards the subspaces spanned by the candidate models. This paper explores this perspective by establishing connections between MA and shrinkage in a linear regression setting with multiple nested models. We first demonstrate that the optimal MA estimator is the best linear estimator with monotonically non-increasing weights in a Gaussian sequence model. The Mallows MA (MMA), which estimates weights by minimizing the Mallows' $C_p$ over the unit simplex, can be viewed as a variation of the sum of a set of positive-part Stein estimators. Indeed, the latter estimator differs from the MMA only in that its optimization of Mallows' $C_p$ is within a suitably relaxed weight set. Motivated by these connections, we develop a novel MA procedure based on a blockwise Stein estimation. The resulting Stein-type MA estimator is asymptotically optimal across a broad parameter space when the variance is known. Numerical results support our theoretical findings. The connections established in this paper may open up new avenues for investigating MA from different perspectives. A discussion on some topics for future research concludes the paper.
Deep learning models can exhibit what appears to be a sudden ability to solve a new problem as training time ($T$), training data ($D$), or model size ($N$) increases, a phenomenon known as emergence. In this paper, we present a framework where each new ability (a skill) is represented as a basis function. We solve a simple multi-linear model in this skill-basis, finding analytic expressions for the emergence of new skills, as well as for scaling laws of the loss with training time, data size, model size, and optimal compute ($C$). We compare our detailed calculations to direct simulations of a two-layer neural network trained on multitask sparse parity, where the tasks in the dataset are distributed according to a power-law. Our simple model captures, using a single fit parameter, the sigmoidal emergence of multiple new skills as training time, data size or model size increases in the neural network.
Logistic regression is widely used in many areas of knowledge. Several works compare the performance of lasso and maximum likelihood estimation in logistic regression. However, part of these works do not perform simulation studies and the remaining ones do not consider scenarios in which the ratio of the number of covariates to sample size is high. In this work, we compare the discrimination performance of lasso and maximum likelihood estimation in logistic regression using simulation studies and applications. Variable selection is done both by lasso and by stepwise when maximum likelihood estimation is used. We consider a wide range of values for the ratio of the number of covariates to sample size. The main conclusion of the work is that lasso has a better discrimination performance than maximum likelihood estimation when the ratio of the number of covariates to sample size is high.