We propose a novel two-stage subsampling algorithm based on optimal design principles. In the first stage, we use a density-based clustering algorithm to identify an approximating design space for the predictors from an initial subsample. Next, we determine an optimal approximate design on this design space. Finally, we use matrix distances such as the Procrustes, Frobenius, and square-root distance to define the remaining subsample, such that its points are "closest" to the support points of the optimal design. Our approach reflects the specific nature of the information matrix as a weighted sum of non-negative definite Fisher information matrices evaluated at the design points and applies to a large class of regression models including models where the Fisher information is of rank larger than $1$.
We provide the first convergence guarantees for the Consistency Models (CMs), a newly emerging type of one-step generative models that can generate comparable samples to those generated by Diffusion Models. Our main result is that, under the basic assumptions on score-matching errors, consistency errors and smoothness of the data distribution, CMs can efficiently sample from any realistic data distribution in one step with small $W_2$ error. Our results (1) hold for $L^2$-accurate score and consistency assumption (rather than $L^\infty$-accurate); (2) do note require strong assumptions on the data distribution such as log-Sobelev inequality; (3) scale polynomially in all parameters; and (4) match the state-of-the-art convergence guarantee for score-based generative models (SGMs). We also provide the result that the Multistep Consistency Sampling procedure can further reduce the error comparing to one step sampling, which support the original statement of "Consistency Models, Yang Song 2023". Our result further imply a TV error guarantee when take some Langevin-based modifications to the output distributions.
This paper presents a novel, interdisciplinary study that leverages a Machine Learning (ML) assisted framework to explore the geometry of affine Deligne-Lusztig varieties (ADLV). The primary objective is to investigate the nonemptiness pattern, dimension and enumeration of irreducible components of ADLV. Our proposed framework demonstrates a recursive pipeline of data generation, model training, pattern analysis, and human examination, presenting an intricate interplay between ML and pure mathematical research. Notably, our data-generation process is nuanced, emphasizing the selection of meaningful subsets and appropriate feature sets. We demonstrate that this framework has a potential to accelerate pure mathematical research, leading to the discovery of new conjectures and promising research directions that could otherwise take significant time to uncover. We rediscover the virtual dimension formula and provide a full mathematical proof of a newly identified problem concerning a certain lower bound of dimension. Furthermore, we extend an open invitation to the readers by providing the source code for computing ADLV and the ML models, promoting further explorations. This paper concludes by sharing valuable experiences and highlighting lessons learned from this collaboration.
In this paper, we address the problem of modeling data with periodic autoregressive (PAR) time series and additive noise. In most cases, the data are processed assuming a noise-free model (i.e., without additive noise), which is not a realistic assumption in real life. The first two steps in PAR model identification are order selection and period estimation, so the main focus is on these issues. Finally, the model should be validated, so a procedure for analyzing the residuals, which are considered here as multidimensional vectors, is proposed. Both order and period selection, as well as model validation, are addressed by using the characteristic function (CF) of the residual series. The CF is used to obtain the probability density function, which is utilized in the information criterion and for residuals distribution testing. To complete the PAR model analysis, the procedure for estimating the coefficients is necessary. However, this issue is only mentioned here as it is a separate task (under consideration in parallel). The presented methodology can be considered as the general framework for analyzing data with periodically non-stationary characteristics disturbed by finite-variance external noise. The original contribution is in the selection of the optimal model order and period identification, as well as the analysis of residuals. All these findings have been inspired by our previous work on machine condition monitoring that used PAR modeling
Ecological spatial areal models encounter the well-known and challenging problem of spatial confounding. This issue makes it arduous to distinguish between the impacts of observed covariates and spatial random effects. Despite previous research and various proposed methods to tackle this problem, finding a definitive solution remains elusive. In this paper, we propose a one-step version of the spatial+ approach that involves dividing the covariate into two components. One component captures large-scale spatial dependence, while the other accounts for short-scale dependence. This approach eliminates the need to separately fit spatial models for the covariates. We apply this method to analyze two forms of crimes against women, namely rapes and dowry deaths, in Uttar Pradesh, India, exploring their relationship with socio-demographic covariates. To evaluate the performance of the new approach, we conduct extensive simulation studies under different spatial confounding scenarios. The results demonstrate that the proposed method provides reliable estimates of fixed effects and posterior correlations between different responses.
We combine Kronecker products, and quantitative information flow, to give a novel formal analysis for the fine-grained verification of utility in complex privacy pipelines. The combination explains a surprising anomaly in the behaviour of utility of privacy-preserving pipelines -- that sometimes a reduction in privacy results also in a decrease in utility. We use the standard measure of utility for Bayesian analysis, introduced by Ghosh at al., to produce tractable and rigorous proofs of the fine-grained statistical behaviour leading to the anomaly. More generally, we offer the prospect of formal-analysis tools for utility that complement extant formal analyses of privacy. We demonstrate our results on a number of common privacy-preserving designs.
We propose a new way to assess certain short constructed responses to mathematics items. Our approach uses a pipeline that identifies the key values specified by the student in their response. This allows us to determine the correctness of the response, as well as identify any misconceptions. The information from the value identification pipeline can then be used to provide feedback to the teacher and student. The value identification pipeline consists of two fine-tuned language models. The first model determines if a value is implicit in the student response. The second model identifies where in the response the key value is specified. We consider both a generic model that can be used for any prompt and value, as well as models that are specific to each prompt and value. The value identification pipeline is a more accurate and informative way to assess short constructed responses than traditional rubric-based scoring. It can be used to provide more targeted feedback to students, which can help them improve their understanding of mathematics.
We report some results regarding the mechanization of normative (preference-based) conditional reasoning. Our focus is on Aqvist's system E for conditional obligation (and its extensions). Our mechanization is achieved via a shallow semantical embedding in Isabelle/HOL. We consider two possible uses of the framework. The first one is as a tool for meta-reasoning about the considered logic. We employ it for the automated verification of deontic correspondences (broadly conceived) and related matters, analogous to what has been previously achieved for the modal logic cube. The second use is as a tool for assessing ethical arguments. We provide a computer encoding of a well-known paradox in population ethics, Parfit's repugnant conclusion. Whether the presented encoding increases or decreases the attractiveness and persuasiveness of the repugnant conclusion is a question we would like to pass on to philosophy and ethics.
A pervasive challenge in neuroscience is testing whether neuronal connectivity changes over time due to specific causes, such as stimuli, events, or clinical interventions. Recent hardware innovations and falling data storage costs enable longer, more naturalistic neuronal recordings. The implicit opportunity for understanding the self-organised brain calls for new analysis methods that link temporal scales: from the order of milliseconds over which neuronal dynamics evolve, to the order of minutes, days or even years over which experimental observations unfold. This review article demonstrates how hierarchical generative models and Bayesian inference help to characterise neuronal activity across different time scales. Crucially, these methods go beyond describing statistical associations among observations and enable inference about underlying mechanisms. We offer an overview of fundamental concepts in state-space modeling and suggest a taxonomy for these methods. Additionally, we introduce key mathematical principles that underscore a separation of temporal scales, such as the slaving principle, and review Bayesian methods that are being used to test hypotheses about the brain with multi-scale data. We hope that this review will serve as a useful primer for experimental and computational neuroscientists on the state of the art and current directions of travel in the complex systems modelling literature.
In this paper we develop a numerical method for efficiently approximating solutions of certain Zakai equations in high dimensions. The key idea is to transform a given Zakai SPDE into a PDE with random coefficients. We show that under suitable regularity assumptions on the coefficients of the Zakai equation, the corresponding random PDE admits a solution random field which, for almost all realizations of the random coefficients, can be written as a classical solution of a linear parabolic PDE. This makes it possible to apply the Feynman--Kac formula to obtain an efficient Monte Carlo scheme for computing approximate solutions of Zakai equations. The approach achieves good results in up to 25 dimensions with fast run times.
Many inductive logic programming approaches struggle to learn programs from noisy data. To overcome this limitation, we introduce an approach that learns minimal description length programs from noisy data, including recursive programs. Our experiments on several domains, including drug design, game playing, and program synthesis, show that our approach can outperform existing approaches in terms of predictive accuracies and scale to moderate amounts of noise.