With the proliferation of ever more complicated Deep Learning architectures, data synthesis is a highly promising technique to address the demand of data-hungry models. However, reliably assessing the quality of a 'synthesiser' model's output is an open research question with significant associated risks for high-stake domains. To address this challenge, we have designed a unique confident data synthesis algorithm that introduces statistical confidence guarantees through a novel extension of the Conformal Prediction framework. We support our proposed algorithm with theoretical proofs and an extensive empirical evaluation of five benchmark datasets. To show our approach's versatility on ubiquitous real-world challenges, the datasets were carefully selected for their variety of difficult characteristics: low sample count, class imbalance and non-separability, and privacy-sensitive data. In all trials, training sets extended with our confident synthesised data performed at least as well as the original, and frequently significantly improved Deep Learning performance by up to +65% F1-score.
Spatially distributed functional data are prevalent in many statistical applications such as meteorology, energy forecasting, census data, disease mapping, and neurological studies. Given their complex and high-dimensional nature, functional data often require dimension reduction methods to extract meaningful information. Inverse regression is one such approach that has become very popular in the past two decades. We study the inverse regression in the framework of functional data observed at irregularly positioned spatial sites. The functional predictor is the sum of a spatially dependent functional effect and a spatially independent functional nugget effect, while the relation between the scalar response and the functional predictor is modeled using the inverse regression framework. For estimation, we consider local linear smoothing with a general weighting scheme, which includes as special cases the schemes under which equal weights are assigned to each observation or to each subject. This framework enables us to present the asymptotic results for different types of sampling plans over time such as non-dense, dense, and ultra-dense. We discuss the domain-expanding infill (DEI) framework for spatial asymptotics, which is a mix of the traditional expanding domain and infill frameworks. The DEI framework overcomes the limitations of traditional spatial asymptotics in the existing literature. Under this unified framework, we develop asymptotic theory and identify conditions that are necessary for the estimated eigen-directions to achieve optimal rates of convergence. Our asymptotic results include pointwise and $L_2$ convergence rates. Simulation studies using synthetic data and an application to a real-world dataset confirm the effectiveness of our methods.
The Koopman operator serves as the theoretical backbone for machine learning of dynamical control systems, where the operator is heuristically approximated by extended dynamic mode decomposition (EDMD). In this paper, we propose Stability- and certificate-oriented EDMD (SafEDMD): a novel EDMD-based learning architecture which comes along with rigorous certificates, resulting in a reliable surrogate model generated in a data-driven fashion. To ensure trustworthiness of SafEDMD, we derive proportional error bounds, which vanish at the origin and are tailored for control tasks, leading to certified controller design based on semi-definite programming. We illustrate the developed machinery by means of several benchmark examples and highlight the advantages over state-of-the-art methods.
Prompt engineering is a challenging and important task due to the high sensitivity of Large Language Models (LLMs) to the given prompt and the inherent ambiguity of a textual task instruction. Automatic prompt engineering is essential to achieve optimized performance from LLMs. Recent studies have demonstrated the capabilities of LLMs to automatically conduct prompt engineering by employing a meta-prompt that incorporates the outcomes of the last trials and proposes an improved prompt. However, this requires a high-quality benchmark to compare different prompts, which is difficult and expensive to acquire in many real-world use cases. In this work, we introduce a new method for automatic prompt engineering, using a calibration process that iteratively refines the prompt to the user intent. During the optimization process, the system jointly generates synthetic data of boundary use cases and optimizes the prompt according to the generated dataset. We demonstrate the effectiveness of our method with respect to strong proprietary models on real-world tasks such as moderation and generation. Our method outperforms state-of-the-art methods with a limited number of annotated samples. Furthermore, we validate the advantages of each one of the system's key components. Our system is built in a modular way, facilitating easy adaptation to other tasks. The code is available $\href{//github.com/Eladlev/AutoPrompt}{here}$.
Investigators often use multi-source data (e.g., multi-center trials, meta-analyses of randomized trials, pooled analyses of observational cohorts) to learn about the effects of interventions in subgroups of some well-defined target population. Such a target population can correspond to one of the data sources of the multi-source data or an external population in which the treatment and outcome information may not be available. We develop and evaluate methods for using multi-source data to estimate subgroup potential outcome means and treatment effects in a target population. We consider identifiability conditions and propose doubly robust estimators that, under mild conditions, are non-parametrically efficient and allow for nuisance functions to be estimated using flexible data-adaptive methods (e.g., machine learning techniques). We also show how to construct confidence intervals and simultaneous confidence bands for the estimated subgroup treatment effects. We examine the properties of the proposed estimators in simulation studies and compare performance against alternative estimators. We also conclude that our methods work well when the sample size of the target population is much larger than the sample size of the multi-source data. We illustrate the proposed methods in a meta-analysis of randomized trials for schizophrenia.
As humans advance toward a higher level of artificial intelligence, it is always at the cost of escalating computational resource consumption, which requires developing novel solutions to meet the exponential growth of AI computing demand. Neuromorphic hardware takes inspiration from how the brain processes information and promises energy-efficient computing of AI workloads. Despite its potential, neuromorphic hardware has not found its way into commercial AI data centers. In this article, we try to analyze the underlying reasons for this and derive requirements and guidelines to promote neuromorphic systems for efficient and sustainable cloud computing: We first review currently available neuromorphic hardware systems and collect examples where neuromorphic solutions excel conventional AI processing on CPUs and GPUs. Next, we identify applications, models and algorithms which are commonly deployed in AI data centers as further directions for neuromorphic algorithms research. Last, we derive requirements and best practices for the hardware and software integration of neuromorphic systems into data centers. With this article, we hope to increase awareness of the challenges of integrating neuromorphic hardware into data centers and to guide the community to enable sustainable and energy-efficient AI at scale.
A Gaussian process is proposed as a model for the posterior distribution of the local predictive ability of a model or expert, conditional on a vec- tor of covariates, from historical predictions in the form of log predictive scores. Assuming Gaussian expert predictions and a Gaussian data generat- ing process, a linear transformation of the predictive score follows a noncen- tral chi-squared distribution with one degree of freedom. Motivated by this we develop a non-central chi-squared Gaussian process regression to flexibly model local predictive ability, with the posterior distribution of the latent GP function and kernel hyperparameters sampled by Hamiltonian Monte Carlo. We show that a cube-root transformation of the log scores is approximately Gaussian with homoscedastic variance, which makes it possible to estimate the model much faster by marginalizing the latent GP function analytically. Linear pools based on learned local predictive ability are applied to predict daily bike usage in Washington DC.
Automatic Speech Recognition (ASR) systems are used in the financial domain to enhance the caller experience by enabling natural language understanding and facilitating efficient and intuitive interactions. Increasing use of ASR systems requires that such systems exhibit very low error rates. The predominant ASR models to collect numeric data are large, general-purpose commercial models -- Google Speech-to-text (STT), or Amazon Transcribe -- or open source (OpenAI's Whisper). Such ASR models are trained on hundreds of thousands of hours of audio data and require considerable resources to run. Despite recent progress large speech recognition models, we highlight the potential of smaller, specialized "micro" models. Such light models can be trained perform well on number recognition specific tasks, competing with general models like Whisper or Google STT while using less than 80 minutes of training time and occupying at least an order of less memory resources. Also, unlike larger speech recognition models, micro-models are trained on carefully selected and curated datasets, which makes them highly accurate, agile, and easy to retrain, while using low compute resources. We present our work on creating micro models for multi-digit number recognition that handle diverse speaking styles reflecting real-world pronunciation patterns. Our work contributes to domain-specific ASR models, improving digit recognition accuracy, and privacy of data. An added advantage, their low resource consumption allows them to be hosted on-premise, keeping private data local instead uploading to an external cloud. Our results indicate that our micro-model makes less errors than the best-of-breed commercial or open-source ASRs in recognizing digits (1.8% error rate of our best micro-model versus 5.8% error rate of Whisper), and has a low memory footprint (0.66 GB VRAM for our model versus 11 GB VRAM for Whisper).
This study introduces a two-scale Graph Neural Operator (GNO), namely, LatticeGraphNet (LGN), designed as a surrogate model for costly nonlinear finite-element simulations of three-dimensional latticed parts and structures. LGN has two networks: LGN-i, learning the reduced dynamics of lattices, and LGN-ii, learning the mapping from the reduced representation onto the tetrahedral mesh. LGN can predict deformation for arbitrary lattices, therefore the name operator. Our approach significantly reduces inference time while maintaining high accuracy for unseen simulations, establishing the use of GNOs as efficient surrogate models for evaluating mechanical responses of lattices and structures.
This paper develops a flexible and computationally efficient multivariate volatility model, which allows for dynamic conditional correlations and volatility spillover effects among financial assets. The new model has desirable properties such as identifiability and computational tractability for many assets. A sufficient condition of the strict stationarity is derived for the new process. Two quasi-maximum likelihood estimation methods are proposed for the new model with and without low-rank constraints on the coefficient matrices respectively, and the asymptotic properties for both estimators are established. Moreover, a Bayesian information criterion with selection consistency is developed for order selection, and the testing for volatility spillover effects is carefully discussed. The finite sample performance of the proposed methods is evaluated in simulation studies for small and moderate dimensions. The usefulness of the new model and its inference tools is illustrated by two empirical examples for 5 stock markets and 17 industry portfolios, respectively.
Revealing hidden dynamics from the stochastic data is a challenging problem as randomness takes part in the evolution of the data. The problem becomes exceedingly complex when the trajectories of the stochastic data are absent in many scenarios. Here we present an approach to effectively modeling the dynamics of the stochastic data without trajectories based on the weak form of the Fokker-Planck (FP) equation, which governs the evolution of the density function in the Brownian process. Taking the collocations of Gaussian functions as the test functions in the weak form of the FP equation, we transfer the derivatives to the Gaussian functions and thus approximate the weak form by the expectational sum of the data. With a dictionary representation of the unknown terms, a linear system is built and then solved by the regression, revealing the unknown dynamics of the data. Hence, we name the method with the Weak Collocation Regression (WCR) method for its three key components: weak form, collocation of Gaussian kernels, and regression. The numerical experiments show that our method is flexible and fast, which reveals the dynamics within seconds in multi-dimensional problems and can be easily extended to high-dimensional data such as 20 dimensions. WCR can also correctly identify the hidden dynamics of the complex tasks with variable-dependent diffusion and coupled drift, and the performance is robust, achieving high accuracy in the case with noise added.