Evaluations of model editing currently only use the `next few token' completions after a prompt. As a result, the impact of these methods on longer natural language generation is largely unknown. We introduce long-form evaluation of model editing (\textbf{\textit{LEME}}) a novel evaluation protocol that measures the efficacy and impact of model editing in long-form generative settings. Our protocol consists of a machine-rated survey and a classifier which correlates well with human ratings. Importantly, we find that our protocol has very little relationship with previous short-form metrics (despite being designed to extend efficacy, generalization, locality, and portability into a long-form setting), indicating that our method introduces a novel set of dimensions for understanding model editing methods. Using this protocol, we benchmark a number of model editing techniques and present several findings including that, while some methods (ROME and MEMIT) perform well in making consistent edits within a limited scope, they suffer much more from factual drift than other methods. Finally, we present a qualitative analysis that illustrates common failure modes in long-form generative settings including internal consistency, lexical cohesion, and locality issues.
Fractional dissipation is a powerful tool to study non-local physical phenomena such as damping models. The design of geometric, in particular, variational integrators for the numerical simulation of such systems relies on a variational formulation of the model. In [19], a new approach is proposed to deal with dissipative systems including fractionally damped systems in a variational way for both, the continuous and discrete setting. It is based on the doubling of variables and their fractional derivatives. The aim of this work is to derive higher-order fractional variational integrators by means of convolution quadrature (CQ) based on backward difference formulas. We then provide numerical methods that are of order 2 improving a previous result in [19]. The convergence properties of the fractional variational integrators and saturation effects due to the approximation of the fractional derivatives by CQ are studied numerically.
The present article aims to design and analyze efficient first-order strong schemes for a generalized A\"{i}t-Sahalia type model arising in mathematical finance and evolving in a positive domain $(0, \infty)$, which possesses a diffusion term with superlinear growth and a highly nonlinear drift that blows up at the origin. Such a complicated structure of the model unavoidably causes essential difficulties in the construction and convergence analysis of time discretizations. By incorporating implicitness in the term $\alpha_{-1} x^{-1}$ and a corrective mapping $\Phi_h$ in the recursion, we develop a novel class of explicit and unconditionally positivity-preserving (i.e., for any step-size $h>0$) Milstein-type schemes for the underlying model. In both non-critical and general critical cases, we introduce a novel approach to analyze mean-square error bounds of the novel schemes, without relying on a priori high-order moment bounds of the numerical approximations. The expected order-one mean-square convergence is attained for the proposed scheme. The above theoretical guarantee can be used to justify the optimal complexity of the Multilevel Monte Carlo method. Numerical experiments are finally provided to verify the theoretical findings.
This paper presents a Bayesian regression model relating scalar outcomes to brain functional connectivity represented as symmetric positive definite (SPD) matrices. Unlike many proposals that simply vectorize the matrix-valued connectivity predictors thereby ignoring their geometric structure, the method presented here respects the Riemannian geometry of SPD matrices by using a tangent space modeling. Dimension reduction is performed in the tangent space, relating the resulting low-dimensional representations to the responses. The dimension reduction matrix is learned in a supervised manner with a sparsity-inducing prior imposed on a Stiefel manifold to prevent overfitting. Our method yields a parsimonious regression model that allows uncertainty quantification of all model parameters and identification of key brain regions that predict the outcomes. We demonstrate the performance of our approach in simulation settings and through a case study to predict Picture Vocabulary scores using data from the Human Connectome Project.
The present work is devoted to strong approximations of a generalized A\"{i}t-Sahalia model arising from mathematical finance. The numerical study of the considered model faces essential difficulties caused by a drift that blows up at the origin, highly nonlinear drift and diffusion coefficients and positivity-preserving requirement. In this paper, a novel explicit Euler-type scheme is proposed, which is easily implementable and able to preserve positivity of the original model unconditionally, i.e., for any time step-size $h >0$. A mean-square convergence rate of order $0.5$ is also obtained for the proposed scheme in both non-critical and general critical cases. Our work is motivated by the need to justify the multi-level Monte Carlo (MLMC) simulations for the underlying model, where the rate of mean-square convergence is required and the preservation of positivity is desirable particularly for large discretization time steps. Numerical experiments are finally provided to confirm the theoretical findings.
The present article aims to design and analyze efficient first-order strong schemes for a generalized A\"{i}t-Sahalia type model arising in mathematical finance and evolving in a positive domain $(0, \infty)$, which possesses a diffusion term with superlinear growth and a highly nonlinear drift that blows up at the origin. Such a complicated structure of the model unavoidably causes essential difficulties in the construction and convergence analysis of time discretizations. By incorporating implicitness in the term $\alpha_{-1} x^{-1}$ and a corrective mapping $\Phi_h$ in the recursion, we develop a novel class of explicit and unconditionally positivity-preserving (i.e., for any step-size $h>0$) Milstein-type schemes for the underlying model. In both non-critical and general critical cases, we introduce a novel approach to analyze mean-square error bounds of the novel schemes, without relying on a priori high-order moment bounds of the numerical approximations. The expected order-one mean-square convergence is attained for the proposed scheme. The above theoretical guarantee can be used to justify the optimal complexity of the Multilevel Monte Carlo method. Numerical experiments are finally provided to verify the theoretical findings.
In this work we present denoiSplit, a method to tackle a new analysis task, i.e. the challenge of joint semantic image splitting and unsupervised denoising. This dual approach has important applications in fluorescence microscopy, where semantic image splitting has important applications but noise does generally hinder the downstream analysis of image content. Image splitting involves dissecting an image into its distinguishable semantic structures. We show that the current state-of-the-art method for this task struggles in the presence of image noise, inadvertently also distributing the noise across the predicted outputs. The method we present here can deal with image noise by integrating an unsupervised denoising sub-task. This integration results in improved semantic image unmixing, even in the presence of notable and realistic levels of imaging noise. A key innovation in denoiSplit is the use of specifically formulated noise models and the suitable adjustment of KL-divergence loss for the high-dimensional hierarchical latent space we are training. We showcase the performance of denoiSplit across 4 tasks on real-world microscopy images. Additionally, we perform qualitative and quantitative evaluations and compare results to existing benchmarks, demonstrating the effectiveness of using denoiSplit: a single Variational Splitting Encoder-Decoder (VSE) Network using two suitable noise models to jointly perform semantic splitting and denoising.
This paper studies optimal hypothesis testing for nonregular statistical models with parameter-dependent support. We consider both one-sided and two-sided hypothesis testing and develop asymptotically uniformly most powerful tests based on the likelihood ratio process. The proposed one-sided test involves randomization to achieve asymptotic size control, some tuning constant to avoid discontinuities in the limiting likelihood ratio process, and a user-specified alternative hypothetical value to achieve the asymptotic optimality. Our two-sided test becomes asymptotically uniformly most powerful without imposing further restrictions such as unbiasedness. Simulation results illustrate desirable power properties of the proposed tests.
This work presents a maturity model for assessing catalogues of semantic artefacts, one of the keystones that permit semantic interoperability of systems. We defined the dimensions and related features to include in the maturity model by analysing the current literature and existing catalogues of semantic artefacts provided by experts. In addition, we assessed 26 different catalogues to demonstrate the effectiveness of the maturity model, which includes 12 different dimensions (Metadata, Openness, Quality, Availability, Statistics, PID, Governance, Community, Sustainability, Technology, Transparency, and Assessment) and 43 related features (or sub-criteria) associated with these dimensions. Such a maturity model is one of the first attempts to provide recommendations for governance and processes for preserving and maintaining semantic artefacts and helps assess/address interoperability challenges.
A goodness-of-fit test for the Functional Linear Model with Scalar Response (FLMSR) with responses Missing at Random (MAR) is proposed in this paper. The test statistic relies on a marked empirical process indexed by the projected functional covariate and its distribution under the null hypothesis is calibrated using a wild bootstrap procedure. The computation and performance of the test rely on having an accurate estimator of the functional slope of the FLMSR when the sample has MAR responses. Three estimation methods based on the Functional Principal Components (FPCs) of the covariate are considered. First, the simplified method estimates the functional slope by simply discarding observations with missing responses. Second, the imputed method estimates the functional slope by imputing the missing responses using the simplified estimator. Third, the inverse probability weighted method incorporates the missing response generation mechanism when imputing. Furthermore, both cross-validation and LASSO regression are used to select the FPCs used by each estimator. Several Monte Carlo experiments are conducted to analyze the behavior of the testing procedure in combination with the functional slope estimators. Results indicate that estimators performing missing-response imputation achieve the highest power. The testing procedure is applied to check for linear dependence between the average number of sunny days per year and the mean curve of daily temperatures at weather stations in Spain.
We develop a new rank-based approach for univariate two-sample testing in the presence of missing data which makes no assumptions about the missingness mechanism. This approach is a theoretical extension of the Wilcoxon-Mann-Whitney test that controls the Type I error by providing exact bounds for the test statistic after accounting for the number of missing values. Greater statistical power is shown when the method is extended to account for a bounded domain. Furthermore, exact bounds are provided on the proportions of data that can be missing in the two samples while yielding a significant result. Simulations demonstrate that our method has good power, typically for cases of $10\%$ to $20\%$ missing data, while standard imputation approaches fail to control the Type I error. We illustrate our method on complex clinical trial data in which patients' withdrawal from the trial lead to missing values.