Industrial projects rely heavily on lengthy, complex specification documents, making tedious manual extraction of structured information a major bottleneck. This paper introduces an innovative approach to automate this process, leveraging the capabilities of two cutting-edge AI models: Donut, a model that extracts information directly from scanned documents without OCR, and OpenAI GPT-3.5 Turbo, a robust large language model. The proposed methodology is initiated by acquiring the table of contents (ToCs) from construction specification documents and subsequently structuring the ToCs text into JSON data. Remarkable accuracy is achieved, with Donut reaching 85% and GPT-3.5 Turbo reaching 89% in effectively organizing the ToCs. This landmark achievement represents a significant leap forward in document indexing, demonstrating the immense potential of AI to automate information extraction tasks across diverse document types, boosting efficiency and liberating critical resources in various industries.
In contemporary problems involving genetic or neuroimaging data, thousands of hypotheses need to be tested. Due to their high power, and finite sample guarantees on type-1 error under weak assumptions, Monte-Carlo permutation tests are often considered as gold standard for these settings. However, the enormous computational effort required for (thousands of) permutation tests is a major burden. Recently, Fischer and Ramdas (2024) constructed a permutation test for a single hypothesis in which the permutations are drawn sequentially one-by-one and the testing process can be stopped at any point without inflating the type I error. They showed that the number of permutations can be substantially reduced (under null and alternative) while the power remains similar. We show how their approach can be modified to make it suitable for a broad class of multiple testing procedures. In particular, we discuss its use with the Benjamini-Hochberg procedure and illustrate the application on a large dataset.
Deep learning methods are increasingly becoming instrumental as modeling tools in computational neuroscience, employing optimality principles to build bridges between neural responses and perception or behavior. Developing models that adequately represent uncertainty is however challenging for deep learning methods, which often suffer from calibration problems. This constitutes a difficulty in particular when modeling cortical circuits in terms of Bayesian inference, beyond single point estimates such as the posterior mean or the maximum a posteriori. In this work we systematically studied uncertainty representations in latent representations of variational auto-encoders (VAEs), both in a perceptual task from natural images and in two other canonical tasks of computer vision, finding a poor alignment between uncertainty and informativeness or ambiguities in the images. We next showed how a novel approach which we call explaining-away variational auto-encoders (EA-VAEs), fixes these issues, producing meaningful reports of uncertainty in a variety of scenarios, including interpolation, image corruption, and even out-of-distribution detection. We show EA-VAEs may prove useful both as models of perception in computational neuroscience and as inference tools in computer vision.
We provide full theoretical guarantees for the convergence behaviour of diffusion-based generative models under the assumption of strongly log-concave data distributions while our approximating class of functions used for score estimation is made of Lipschitz continuous functions. We demonstrate via a motivating example, sampling from a Gaussian distribution with unknown mean, the powerfulness of our approach. In this case, explicit estimates are provided for the associated optimization problem, i.e. score approximation, while these are combined with the corresponding sampling estimates. As a result, we obtain the best known upper bound estimates in terms of key quantities of interest, such as the dimension and rates of convergence, for the Wasserstein-2 distance between the data distribution (Gaussian with unknown mean) and our sampling algorithm. Beyond the motivating example and in order to allow for the use of a diverse range of stochastic optimizers, we present our results using an $L^2$-accurate score estimation assumption, which crucially is formed under an expectation with respect to the stochastic optimizer and our novel auxiliary process that uses only known information. This approach yields the best known convergence rate for our sampling algorithm.
Test-negative designs are widely used for post-market evaluation of vaccine effectiveness, particularly in cases where randomization is not feasible. Differing from classical test-negative designs where only healthcare-seekers with symptoms are included, recent test-negative designs have involved individuals with various reasons for testing, especially in an outbreak setting. While including these data can increase sample size and hence improve precision, concerns have been raised about whether they introduce bias into the current framework of test-negative designs, thereby demanding a formal statistical examination of this modified design. In this article, using statistical derivations, causal graphs, and numerical simulations, we show that the standard odds ratio estimator may be biased if various reasons for testing are not accounted for. To eliminate this bias, we identify three categories of reasons for testing, including symptoms, disease-unrelated reasons, and case contact tracing, and characterize associated statistical properties and estimands. Based on our characterization, we show how to consistently estimate each estimand via stratification. Furthermore, we describe when these estimands correspond to the same vaccine effectiveness parameter, and, when appropriate, propose a stratified estimator that can incorporate multiple reasons for testing and improve precision. The performance of our proposed method is demonstrated through simulation studies.
Cross-validation is usually employed to evaluate the performance of a given statistical methodology. When such a methodology depends on a number of tuning parameters, cross-validation proves to be helpful to select the parameters that optimize the estimated performance. In this paper, however, a very different and nonstandard use of cross-validation is investigated. Instead of focusing on the cross-validated parameters, the main interest is switched to the estimated value of the error criterion at optimal performance. It is shown that this approach is able to provide consistent and efficient estimates of some density functionals, with the noteworthy feature that these estimates do not rely on the choice of any further tuning parameter, so that, in that sense, they can be considered to be purely empirical. Here, a base case of application of this new paradigm is developed in full detail, while many other possible extensions are hinted as well.
In this paper a set of previous general results for the development of B--series for a broad class of stochastic differential equations has been collected. The applicability of these results is demonstrated by the derivation of B--series for non-autonomous semi-linear SDEs and exponential Runge-Kutta methods applied to this class of SDEs, which is a significant generalization of existing theory on such methods.
We consider problems where many, somewhat redundant, hypotheses are tested and we are interested in reporting the most precise rejections, with false discovery rate (FDR) control. This is the case, for example, when researchers are interested both in individual hypotheses as well as group hypotheses corresponding to intersections of sets of the original hypotheses, at several resolution levels. A concrete application is in genome-wide association studies, where, depending on the signal strengths, it might be possible to resolve the influence of individual genetic variants on a phenotype with greater or lower precision. To adapt to the unknown signal strength, analyses are conducted at multiple resolutions and researchers are most interested in the more precise discoveries. Assuring FDR control on the reported findings with these adaptive searches is, however, often impossible. To design a multiple comparison procedure that allows for an adaptive choice of resolution with FDR control, we leverage e-values and linear programming. We adapt this approach to problems where knockoffs and group knockoffs have been successfully applied to test conditional independence hypotheses. We demonstrate its efficacy by analyzing data from the UK Biobank.
In (Dzanic, J. Comp. Phys., 508:113010, 2024), a limiting approach for high-order discontinuous Galerkin schemes was introduced which allowed for imposing constraints on the solution continuously (i.e., everywhere within the element). While exact for linear constraint functionals, this approach only imposed a sufficient (but not the minimum necessary) amount of limiting for nonlinear constraint functionals. This short note shows how this limiting approach can be extended to allow exactness for general nonlinear quasiconcave constraint functionals through a nonlinear limiting procedure, reducing unnecessary numerical dissipation. Some examples are shown for nonlinear pressure and entropy constraints in the compressible gas dynamics equations, where both analytic and iterative approaches are used.
This work presents GAL{\AE}XI as a novel, energy-efficient flow solver for the simulation of compressible flows on unstructured meshes leveraging the parallel computing power of modern Graphics Processing Units (GPUs). GAL{\AE}XI implements the high-order Discontinuous Galerkin Spectral Element Method (DGSEM) using shock capturing with a finite-volume subcell approach to ensure the stability of the high-order scheme near shocks. This work provides details on the general code design, the parallelization strategy, and the implementation approach for the compute kernels with a focus on the element local mappings between volume and surface data due to the unstructured mesh. GAL{\AE}XI exhibits excellent strong scaling properties up to 1024 GPUs if each GPU is assigned a minimum of one million degrees of freedom degrees of freedom. To verify its implementation, a convergence study is performed that recovers the theoretical order of convergence of the implemented numerical schemes. Moreover, the solver is validated using both the incompressible and compressible formulation of the Taylor-Green-Vortex at a Mach number of 0.1 and 1.25, respectively. A mesh convergence study shows that the results converge to the high-fidelity reference solution and that the results match the original CPU implementation. Finally, GAL{\AE}XI is applied to a large-scale wall-resolved large eddy simulation of a linear cascade of the NASA Rotor 37. Here, the supersonic region and shocks at the leading edge are captured accurately and robustly by the implemented shock-capturing approach. It is demonstrated that GAL{\AE}XI requires less than half of the energy to carry out this simulation in comparison to the reference CPU implementation. This renders GAL{\AE}XI as a potent tool for accurate and efficient simulations of compressible flows in the realm of exascale computing and the associated new HPC architectures.
In this paper, we propose a new algorithm, the irrational-window-filter projection method (IWFPM), for solving arbitrary dimensional global quasiperiodic systems. Based on the projection method (PM), IWFPM further utilizes the concentrated distribution of Fourier coefficients to filter out relevant spectral points using an irrational window. Moreover, a corresponding index-shift transform is designed to make the Fast Fourier Transform available. The corresponding error analysis on the function approximation level is also given. We apply IWFPM to 1D, 2D, and 3D quasiperiodic Schr\"odinger eigenproblems to demonstrate its accuracy and efficiency. IWFPM exhibits a significant computational advantage over PM for both extended and localized quantum states. Furthermore, the widespread existence of such spectral point distribution feature can endow IWFPM with significant potential for broader applications in quasiperiodic systems.